-
Notifications
You must be signed in to change notification settings - Fork 198
sync: Completely parallelize the initial payload #136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sync: Completely parallelize the initial payload #136
Conversation
/test e2e-aws |
1 similar comment
/test e2e-aws |
d3f6d5a
to
2fb95a7
Compare
Console continues to take the longest - last run was 11m, monitoring finished at 8m30 and console took just about 5m total. |
/test e2e-aws |
1 similar comment
/test e2e-aws |
Ok, looks like parallel installation was able to break at least openshift-apiserver: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/136/pull-ci-openshift-cluster-version-operator-master-e2e-aws/459 /retest |
/retest |
2 similar comments
/retest |
/retest |
/test e2e-aws |
So I've seen one break out of 6, which is about what mainline is. I think this is ready for review @abhinavdahiya |
2fb95a7
to
01ac002
Compare
Increase parallism to maximum tasks, increase backoff, wait longer for cluster operator, and generally try to load everything at the same time.
Initialization can be bursty because we don't have any competing workloads.
If the operator reports level and is available, we're good to go.
01ac002
to
b0b4902
Compare
/retest |
}, | ||
}, | ||
{ | ||
tasks: tasks("0000_01_x-y-z_file1", "0000_01_x-y-z_file2"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is same as fee2d06#diff-6c3be3194acbe263715006cdee7e9442R276 multiple grouped items in single node
This looks LGTM. $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/136/pull-ci-openshift-cluster-version-operator-master-e2e-aws/468/artifacts/e2e-aws/clusterversion.json | jq '.items[] | select (.metadata.name == "version") | .status.history[0] | {started: .startedTime, completed: .completionTime}'
{
"started": "2019-03-13T15:39:00Z",
"completed": "2019-03-13T15:50:10Z"
}
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/136/pull-ci-openshift-cluster-version-operator-master-e2e-aws/464/artifacts/e2e-aws/clusterversion.json | jq '.items[] | select (.metadata.name == "version") | .status.history[0] | {started: .startedTime, completed: .completionTime}'
{
"started": "2019-03-13T00:24:35Z",
"completed": "2019-03-13T00:35:22Z"
}
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/136/pull-ci-openshift-cluster-version-operator-master-e2e-aws/458/artifacts/e2e-aws/clusterversion.json | jq '.items[] | select (.metadata.name == "version") | .status.history[0] | {started: .startedTime, completed: .completionTime}'
{
"started": "2019-03-12T05:04:14Z",
"completed": "2019-03-12T05:18:23Z"
} on an average 11 min to level. compared to CI runs on installer (without this change): $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1407/pull-ci-openshift-installer-master-e2e-aws/4464/artifacts/e2e-aws/clusterversion.json | jq '.items[] | select (.metadata.name == "version") | .status.history[0] | {started: .startedTime, completed: .completionTime}'
{
"started": "2019-03-12T19:35:32Z",
"completed": "2019-03-12T19:54:15Z"
}
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1407/pull-ci-openshift-installer-master-e2e-aws/4475/artifacts/e2e-aws/clusterversion.json | jq '.items[] | select (.metadata.name == "version") | .status.history[0] | {started: .startedTime, completed: .completionTime}'
{
"started": "2019-03-13T15:29:54Z",
"completed": "2019-03-13T15:48:44Z"
}
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1393/pull-ci-openshift-installer-master-e2e-aws/4470/artifacts/e2e-aws/clusterversion.json | jq '.items[] | select (.metadata.name == "version") | .status.history[0] | {started: .startedTime, completed: .completionTime}'
{
"started": "2019-03-13T02:55:20Z",
"completed": "2019-03-13T03:13:48Z"
} so on an average 19min to level. This adds a significant speed up. 💯 /approve |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya, smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
As long as the ClusterOperator is Available=True and setting the expected versions, we should be able to move forward. We'd softened our install-time handling to act this way back in b0b4902 (clusteroperator: Don't block on failing during initialization, 2019-03-11, openshift#136) when the type was Failing. The motivation there had been install speed [1]. And a degraded operator may slow dependent components in their own transitions. But as long as the operator/operand are available at all, it should not block depndent components from transitioning. We still have the critical ClusterOperatorDegraded waking admins up when an operator goes Degraded=True for a while, we will just no longer block updates at that point. We won't block ReconcilingMode manifest application either, but since that's already flattened and permuted, and ClusterOperator tend to be towards the end of their TaskNode, the impact on ReconcilingMode is minimal (except that we will no longer go Failing=True in ClusterVersion when the only issue is some Degraded=True ClusterOperator). [1]: openshift#136 (comment)
We have blocking on this condition since 545c342 (api: make status substruct on operatorstatus, 2018-10-15, openshift#31) when it was Failing. We'd softened our install-time handling to act this way back in b0b4902 (clusteroperator: Don't block on failing during initialization, 2019-03-11, openshift#136), motivated by install speed [1]. And a degraded operator may slow dependent components in their own transitions. But as long as the operator/operand are available at all, it should not block depndent components from transitioning, so this commit removes the Degraded=True block from the remaining modes. We still have the critical ClusterOperatorDegraded waking admins up when an operator goes Degraded=True for a while, we will just no longer block updates at that point. We won't block ReconcilingMode manifest application either, but since that's already flattened and permuted, and ClusterOperator tend to be towards the end of their TaskNode, the impact on ReconcilingMode is minimal (except that we will no longer go Failing=True in ClusterVersion when the only issue is some Degraded=True ClusterOperator). [1]: openshift#136 (comment)
…usterOperatorDegraded During install, the CVO has pushed manifests into the cluster as fast as possible without blocking on "has the in-cluster resource leveled?" since way back in b0b4902 (clusteroperator: Don't block on failing during initialization, 2019-03-11, openshift#136). That can lead to ClusterOperatorDown and ClusterOperatorDegraded firing during install, as we see in [1], where: * ClusterOperatorDegraded started pending at 5:00:15Z [2]. * Install completed at 5:09:58Z [3]. * ClusterOperatorDegraded started firing at 5:10:04Z [2]. * ClusterOperatorDegraded stopped firing at 5:10:23Z [2]. * The e2e suite complained about [1]: alert ClusterOperatorDegraded fired for 15 seconds with labels: {... name="authentication"...} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1939580) ClusterOperatorDown is similar, but I'll leave addressing it to a separate commit. For ClusterOperatorDegraded, the degraded condition should not be particularly urgent [4], so we should be find bumping it to 'warning' and using 'for: 30m' or something more relaxed than the current 10m. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776 [2]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776 group by (alertstate) (ALERTS{alertname="ClusterOperatorDegraded"}) [3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776/artifacts/e2e-aws-upi/clusterversion.json [4]: openshift/api#916
Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots.
Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots.
Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots.
Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots.
Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots.
Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots. I'd tried dropping: backoff := st.Backoff and passing st.Backoff directly to ExponentialBackoffWithContext, but it turns out that Step() [1]: ... mutates the provided Backoff to update its Steps and Duration. Luckily, Backoff has no pointer properties, so storing as a local variable is sufficient to give us a fresh copy for the local mutations. [1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
…usterOperatorDegraded During install, the CVO has pushed manifests into the cluster as fast as possible without blocking on "has the in-cluster resource leveled?" since way back in b0b4902 (clusteroperator: Don't block on failing during initialization, 2019-03-11, openshift#136). That can lead to ClusterOperatorDown and ClusterOperatorDegraded firing during install, as we see in [1], where: * ClusterOperatorDegraded started pending at 5:00:15Z [2]. * Install completed at 5:09:58Z [3]. * ClusterOperatorDegraded started firing at 5:10:04Z [2]. * ClusterOperatorDegraded stopped firing at 5:10:23Z [2]. * The e2e suite complained about [1]: alert ClusterOperatorDegraded fired for 15 seconds with labels: {... name="authentication"...} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1939580) ClusterOperatorDown is similar, but I'll leave addressing it to a separate commit. For ClusterOperatorDegraded, the degraded condition should not be particularly urgent [4], so we should be find bumping it to 'warning' and using 'for: 30m' or something more relaxed than the current 10m. This commit brings back * fb5257d (install/0000_90_cluster-version-operator_02_servicemonitor: Soften ClusterOperatorDegraded, 2021-05-06, openshift#554) and * 92ed7f1 (install/0000_90_cluster-version-operator_02_servicemonitor: Update ClusterOperatorDegraded message to 30m, 2021-05-08, openshift#556). There are some conflicts, because I am not bringing back 90539f9 (pkg/cvo/metrics: Ignore Degraded for cluster_operator_up, 2021-04-26, openshift#550). But that one had its own conflicts in metrics.go [5], and the conflicts with this commit were orthogonal context issues, so moving this back to 4.7 first won't make it much harder to bring back openshift#550 and such later on, if we decide to do that. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776 [2]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776 group by (alertstate) (ALERTS{alertname="ClusterOperatorDegraded"}) [3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776/artifacts/e2e-aws-upi/clusterversion.json [4]: openshift/api#916 [5]: openshift#550 (comment)
Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots. I'd tried dropping: backoff := st.Backoff and passing st.Backoff directly to ExponentialBackoffWithContext, but it turns out that Step() [1]: ... mutates the provided Backoff to update its Steps and Duration. Luckily, Backoff has no pointer properties, so storing as a local variable is sufficient to give us a fresh copy for the local mutations. [1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots. I'd tried dropping: backoff := st.Backoff and passing st.Backoff directly to ExponentialBackoffWithContext, but it turns out that Step() [1]: ... mutates the provided Backoff to update its Steps and Duration. Luckily, Backoff has no pointer properties, so storing as a local variable is sufficient to give us a fresh copy for the local mutations. [1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots. I'd tried dropping: backoff := st.Backoff and passing st.Backoff directly to ExponentialBackoffWithContext, but it turns out that Step() [1]: ... mutates the provided Backoff to update its Steps and Duration. Luckily, Backoff has no pointer properties, so storing as a local variable is sufficient to give us a fresh copy for the local mutations. [1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots. I'd tried dropping: backoff := st.Backoff and passing st.Backoff directly to ExponentialBackoffWithContext, but it turns out that Step() [1]: ... mutates the provided Backoff to update its Steps and Duration. Luckily, Backoff has no pointer properties, so storing as a local variable is sufficient to give us a fresh copy for the local mutations. [1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots. I'd tried dropping: backoff := st.Backoff and passing st.Backoff directly to ExponentialBackoffWithContext, but it turns out that Step() [1]: ... mutates the provided Backoff to update its Steps and Duration. Luckily, Backoff has no pointer properties, so storing as a local variable is sufficient to give us a fresh copy for the local mutations. [1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots. I'd tried dropping: backoff := st.Backoff and passing st.Backoff directly to ExponentialBackoffWithContext, but it turns out that Step() [1]: ... mutates the provided Backoff to update its Steps and Duration. Luckily, Backoff has no pointer properties, so storing as a local variable is sufficient to give us a fresh copy for the local mutations. [1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
We have blocking on this condition since 545c342 (api: make status substruct on operatorstatus, 2018-10-15, openshift#31) when it was Failing. We'd softened our install-time handling to act this way back in b0b4902 (clusteroperator: Don't block on failing during initialization, 2019-03-11, openshift#136), motivated by install speed [1]. And a degraded operator may slow dependent components in their own transitions. But as long as the operator/operand are available at all, it should not block depndent components from transitioning, so this commit removes the Degraded=True block from the remaining modes. We still have the warning ClusterOperatorDegraded alerting admins when an operator goes Degraded=True for a while, we will just no longer block updates at that point. We won't block ReconcilingMode manifest application either, but since that's already flattened and permuted, and ClusterOperator tend to be towards the end of their TaskNode, the impact on ReconcilingMode is minimal (except that we will no longer go Failing=True in ClusterVersion when the only issue is some Degraded=True ClusterOperator). [1]: openshift#136 (comment)
We have blocking on this condition since 545c342 (api: make status substruct on operatorstatus, 2018-10-15, openshift#31) when it was Failing. We'd softened our install-time handling to act this way back in b0b4902 (clusteroperator: Don't block on failing during initialization, 2019-03-11, openshift#136), motivated by install speed [1]. And a degraded operator may slow dependent components in their own transitions. But as long as the operator/operand are available at all, it should not block depndent components from transitioning, so this commit removes the Degraded=True block from the remaining modes. We still have the warning ClusterOperatorDegraded alerting admins when an operator goes Degraded=True for a while, we will just no longer block updates at that point. We won't block ReconcilingMode manifest application either, but since that's already flattened and permuted, and ClusterOperator tend to be towards the end of their TaskNode, the impact on ReconcilingMode is minimal (except that we will no longer go Failing=True in ClusterVersion when the only issue is some Degraded=True ClusterOperator). [1]: openshift#136 (comment)
We have blocking on this condition since 545c342 (api: make status substruct on operatorstatus, 2018-10-15, openshift#31) when it was Failing. We'd softened our install-time handling to act this way back in b0b4902 (clusteroperator: Don't block on failing during initialization, 2019-03-11, openshift#136), motivated by install speed [1]. And a degraded operator may slow dependent components in their own transitions. But as long as the operator/operand are available at all, it should not block depndent components from transitioning, so this commit removes the Degraded=True block from the remaining modes. We still have the warning ClusterOperatorDegraded alerting admins when an operator goes Degraded=True for a while, we will just no longer block updates at that point. We won't block ReconcilingMode manifest application either, but since that's already flattened and permuted, and ClusterOperator tend to be towards the end of their TaskNode, the impact on ReconcilingMode is minimal (except that we will no longer go Failing=True in ClusterVersion when the only issue is some Degraded=True ClusterOperator). [1]: openshift#136 (comment)
Increase parallism to maximum tasks, increase backoff, wait longer for cluster operator, and generally try to load everything at the same time.
This results in a significant reduction in boot time and allows us to keep run-levels in the payload in the desired update order, not the desired install order.
Builds on top of #133