sync: Completely parallelize the initial payload #136

smarterclayton · 2019-03-11T22:41:08Z

Increase parallism to maximum tasks, increase backoff, wait longer for cluster operator, and generally try to load everything at the same time.

This results in a significant reduction in boot time and allows us to keep run-levels in the payload in the desired update order, not the desired install order.

Builds on top of #133

smarterclayton · 2019-03-12T00:08:07Z

/test e2e-aws

smarterclayton · 2019-03-12T02:25:27Z

/test e2e-aws

smarterclayton · 2019-03-12T04:11:49Z

Console continues to take the longest - last run was 11m, monitoring finished at 8m30 and console took just about 5m total.

smarterclayton · 2019-03-12T04:35:23Z

/test e2e-aws

smarterclayton · 2019-03-12T06:18:09Z

/test e2e-aws

smarterclayton · 2019-03-12T20:53:28Z

Ok, looks like parallel installation was able to break at least openshift-apiserver: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/136/pull-ci-openshift-cluster-version-operator-master-e2e-aws/459

/retest

smarterclayton · 2019-03-12T22:10:09Z

/retest

smarterclayton · 2019-03-12T22:12:46Z

/retest

smarterclayton · 2019-03-12T23:58:56Z

/retest

smarterclayton · 2019-03-12T23:59:01Z

/test e2e-aws

smarterclayton · 2019-03-12T23:59:27Z

So I've seen one break out of 6, which is about what mainline is. I think this is ready for review @abhinavdahiya

Increase parallism to maximum tasks, increase backoff, wait longer for cluster operator, and generally try to load everything at the same time.

Initialization can be bursty because we don't have any competing workloads.

If the operator reports level and is available, we're good to go.

smarterclayton · 2019-03-13T15:07:52Z

/retest

abhinavdahiya · 2019-03-13T16:33:14Z

pkg/payload/task_graph_test.go

+			},
+		},
+		{
+			tasks: tasks("0000_01_x-y-z_file1", "0000_01_x-y-z_file2"),


this is same as fee2d06#diff-6c3be3194acbe263715006cdee7e9442R276 multiple grouped items in single node

abhinavdahiya · 2019-03-13T16:57:26Z

This looks LGTM.
trying to how much speed up we get..
from this PR:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/136/pull-ci-openshift-cluster-version-operator-master-e2e-aws/468/artifacts/e2e-aws/clusterversion.json | jq '.items[] | select (.metadata.name == "version") | .status.history[0] | {started: .startedTime, completed: .completionTime}'
{
  "started": "2019-03-13T15:39:00Z",
  "completed": "2019-03-13T15:50:10Z"
}
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/136/pull-ci-openshift-cluster-version-operator-master-e2e-aws/464/artifacts/e2e-aws/clusterversion.json | jq '.items[] | select (.metadata.name == "version") | .status.history[0] | {started: .startedTime, completed: .completionTime}'
{
  "started": "2019-03-13T00:24:35Z",
  "completed": "2019-03-13T00:35:22Z"
}
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/136/pull-ci-openshift-cluster-version-operator-master-e2e-aws/458/artifacts/e2e-aws/clusterversion.json | jq '.items[] | select (.metadata.name == "version") | .status.history[0] | {started: .startedTime, completed: .completionTime}'
{
  "started": "2019-03-12T05:04:14Z",
  "completed": "2019-03-12T05:18:23Z"
}

on an average 11 min to level.

compared to CI runs on installer (without this change):

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1407/pull-ci-openshift-installer-master-e2e-aws/4464/artifacts/e2e-aws/clusterversion.json | jq '.items[] | select (.metadata.name == "version") | .status.history[0] | {started: .startedTime, completed: .completionTime}'
{
  "started": "2019-03-12T19:35:32Z",
  "completed": "2019-03-12T19:54:15Z"
}
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1407/pull-ci-openshift-installer-master-e2e-aws/4475/artifacts/e2e-aws/clusterversion.json | jq '.items[] | select (.metadata.name == "version") | .status.history[0] | {started: .startedTime, completed: .completionTime}'
{
  "started": "2019-03-13T15:29:54Z",
  "completed": "2019-03-13T15:48:44Z"
}
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1393/pull-ci-openshift-installer-master-e2e-aws/4470/artifacts/e2e-aws/clusterversion.json | jq '.items[] | select (.metadata.name == "version") | .status.history[0] | {started: .startedTime, completed: .completionTime}'
{
  "started": "2019-03-13T02:55:20Z",
  "completed": "2019-03-13T03:13:48Z"
}

so on an average 19min to level.

This adds a significant speed up. 💯

/approve

abhinavdahiya · 2019-03-13T16:59:16Z

/lgtm

openshift-ci-robot · 2019-03-13T16:59:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhinavdahiya,smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

As long as the ClusterOperator is Available=True and setting the expected versions, we should be able to move forward. We'd softened our install-time handling to act this way back in b0b4902 (clusteroperator: Don't block on failing during initialization, 2019-03-11, openshift#136) when the type was Failing. The motivation there had been install speed [1]. And a degraded operator may slow dependent components in their own transitions. But as long as the operator/operand are available at all, it should not block depndent components from transitioning. We still have the critical ClusterOperatorDegraded waking admins up when an operator goes Degraded=True for a while, we will just no longer block updates at that point. We won't block ReconcilingMode manifest application either, but since that's already flattened and permuted, and ClusterOperator tend to be towards the end of their TaskNode, the impact on ReconcilingMode is minimal (except that we will no longer go Failing=True in ClusterVersion when the only issue is some Degraded=True ClusterOperator). [1]: openshift#136 (comment)

We have blocking on this condition since 545c342 (api: make status substruct on operatorstatus, 2018-10-15, openshift#31) when it was Failing. We'd softened our install-time handling to act this way back in b0b4902 (clusteroperator: Don't block on failing during initialization, 2019-03-11, openshift#136), motivated by install speed [1]. And a degraded operator may slow dependent components in their own transitions. But as long as the operator/operand are available at all, it should not block depndent components from transitioning, so this commit removes the Degraded=True block from the remaining modes. We still have the critical ClusterOperatorDegraded waking admins up when an operator goes Degraded=True for a while, we will just no longer block updates at that point. We won't block ReconcilingMode manifest application either, but since that's already flattened and permuted, and ClusterOperator tend to be towards the end of their TaskNode, the impact on ReconcilingMode is minimal (except that we will no longer go Failing=True in ClusterVersion when the only issue is some Degraded=True ClusterOperator). [1]: openshift#136 (comment)

…usterOperatorDegraded During install, the CVO has pushed manifests into the cluster as fast as possible without blocking on "has the in-cluster resource leveled?" since way back in b0b4902 (clusteroperator: Don't block on failing during initialization, 2019-03-11, openshift#136). That can lead to ClusterOperatorDown and ClusterOperatorDegraded firing during install, as we see in [1], where: * ClusterOperatorDegraded started pending at 5:00:15Z [2]. * Install completed at 5:09:58Z [3]. * ClusterOperatorDegraded started firing at 5:10:04Z [2]. * ClusterOperatorDegraded stopped firing at 5:10:23Z [2]. * The e2e suite complained about [1]: alert ClusterOperatorDegraded fired for 15 seconds with labels: {... name="authentication"...} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1939580) ClusterOperatorDown is similar, but I'll leave addressing it to a separate commit. For ClusterOperatorDegraded, the degraded condition should not be particularly urgent [4], so we should be find bumping it to 'warning' and using 'for: 30m' or something more relaxed than the current 10m. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776 [2]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776 group by (alertstate) (ALERTS{alertname="ClusterOperatorDegraded"}) [3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776/artifacts/e2e-aws-upi/clusterversion.json [4]: openshift/api#916

Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots.

Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots. I'd tried dropping: backoff := st.Backoff and passing st.Backoff directly to ExponentialBackoffWithContext, but it turns out that Step() [1]: ... mutates the provided Backoff to update its Steps and Duration. Luckily, Backoff has no pointer properties, so storing as a local variable is sufficient to give us a fresh copy for the local mutations. [1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step

…usterOperatorDegraded During install, the CVO has pushed manifests into the cluster as fast as possible without blocking on "has the in-cluster resource leveled?" since way back in b0b4902 (clusteroperator: Don't block on failing during initialization, 2019-03-11, openshift#136). That can lead to ClusterOperatorDown and ClusterOperatorDegraded firing during install, as we see in [1], where: * ClusterOperatorDegraded started pending at 5:00:15Z [2]. * Install completed at 5:09:58Z [3]. * ClusterOperatorDegraded started firing at 5:10:04Z [2]. * ClusterOperatorDegraded stopped firing at 5:10:23Z [2]. * The e2e suite complained about [1]: alert ClusterOperatorDegraded fired for 15 seconds with labels: {... name="authentication"...} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1939580) ClusterOperatorDown is similar, but I'll leave addressing it to a separate commit. For ClusterOperatorDegraded, the degraded condition should not be particularly urgent [4], so we should be find bumping it to 'warning' and using 'for: 30m' or something more relaxed than the current 10m. This commit brings back * fb5257d (install/0000_90_cluster-version-operator_02_servicemonitor: Soften ClusterOperatorDegraded, 2021-05-06, openshift#554) and * 92ed7f1 (install/0000_90_cluster-version-operator_02_servicemonitor: Update ClusterOperatorDegraded message to 30m, 2021-05-08, openshift#556). There are some conflicts, because I am not bringing back 90539f9 (pkg/cvo/metrics: Ignore Degraded for cluster_operator_up, 2021-04-26, openshift#550). But that one had its own conflicts in metrics.go [5], and the conflicts with this commit were orthogonal context issues, so moving this back to 4.7 first won't make it much harder to bring back openshift#550 and such later on, if we decide to do that. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776 [2]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776 group by (alertstate) (ALERTS{alertname="ClusterOperatorDegraded"}) [3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776/artifacts/e2e-aws-upi/clusterversion.json [4]: openshift/api#916 [5]: openshift#550 (comment)

Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots. I'd tried dropping: backoff := st.Backoff and passing st.Backoff directly to ExponentialBackoffWithContext, but it turns out that Step() [1]: ... mutates the provided Backoff to update its Steps and Duration. Luckily, Backoff has no pointer properties, so storing as a local variable is sufficient to give us a fresh copy for the local mutations. [1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step

We have blocking on this condition since 545c342 (api: make status substruct on operatorstatus, 2018-10-15, openshift#31) when it was Failing. We'd softened our install-time handling to act this way back in b0b4902 (clusteroperator: Don't block on failing during initialization, 2019-03-11, openshift#136), motivated by install speed [1]. And a degraded operator may slow dependent components in their own transitions. But as long as the operator/operand are available at all, it should not block depndent components from transitioning, so this commit removes the Degraded=True block from the remaining modes. We still have the warning ClusterOperatorDegraded alerting admins when an operator goes Degraded=True for a while, we will just no longer block updates at that point. We won't block ReconcilingMode manifest application either, but since that's already flattened and permuted, and ClusterOperator tend to be towards the end of their TaskNode, the impact on ReconcilingMode is minimal (except that we will no longer go Failing=True in ClusterVersion when the only issue is some Degraded=True ClusterOperator). [1]: openshift#136 (comment)

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 11, 2019

openshift-ci-robot requested review from abhinavdahiya and wking March 11, 2019 22:41

openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 11, 2019

smarterclayton force-pushed the fast_init branch 2 times, most recently from d3f6d5a to 2fb95a7 Compare March 12, 2019 03:21

smarterclayton changed the title ~~WIP: sync: Completely parallelize the initial payload~~ sync: Completely parallelize the initial payload Mar 13, 2019

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 13, 2019

smarterclayton mentioned this pull request Mar 13, 2019

WIP: sync: Distinguish between reconcile, initial, and sync states #131

Closed

1 task

smarterclayton force-pushed the fast_init branch from 2fb95a7 to 01ac002 Compare March 13, 2019 01:12

smarterclayton added 3 commits March 12, 2019 22:49

sync: Completely parallelize the initial payload

fee2d06

Increase parallism to maximum tasks, increase backoff, wait longer for cluster operator, and generally try to load everything at the same time.

sync: When initializing, use a higher QPS budget than normal

3668cfa

Initialization can be bursty because we don't have any competing workloads.

clusteroperator: Don't block on failing during initialization

b0b4902

If the operator reports level and is available, we're good to go.

smarterclayton force-pushed the fast_init branch from 01ac002 to b0b4902 Compare March 13, 2019 02:49

openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 13, 2019

abhinavdahiya reviewed Mar 13, 2019

View reviewed changes

openshift-ci-robot assigned abhinavdahiya Mar 13, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 13, 2019

openshift-merge-robot merged commit a4d54ef into openshift:master Mar 13, 2019

wking mentioned this pull request Nov 24, 2020

OTA-542: pkg/cvo/internal: Do not block on Degraded=True ClusterOperator #482

Open

wking mentioned this pull request May 7, 2021

Bug 1957991: install/0000_90_cluster-version-operator_02_servicemonitor: Soften ClusterOperatorDegraded #554

Merged

wking mentioned this pull request May 24, 2021

Bug 1927168: pkg/payload/task: Fail fast for UpdateError #573

Closed

wking mentioned this pull request Jan 5, 2022

Allow installer to include/exclude capabilities based on user selections openshift/enhancements#922

Merged

sync: Completely parallelize the initial payload #136

sync: Completely parallelize the initial payload #136

Uh oh!

Conversation

smarterclayton commented Mar 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smarterclayton commented Mar 12, 2019

Uh oh!

smarterclayton commented Mar 12, 2019

Uh oh!

smarterclayton commented Mar 12, 2019

Uh oh!

smarterclayton commented Mar 12, 2019

Uh oh!

smarterclayton commented Mar 12, 2019

Uh oh!

smarterclayton commented Mar 12, 2019

Uh oh!

smarterclayton commented Mar 12, 2019

Uh oh!

smarterclayton commented Mar 12, 2019

Uh oh!

smarterclayton commented Mar 12, 2019

Uh oh!

smarterclayton commented Mar 12, 2019

Uh oh!

smarterclayton commented Mar 12, 2019

Uh oh!

smarterclayton commented Mar 13, 2019

Uh oh!

abhinavdahiya Mar 13, 2019

Choose a reason for hiding this comment

Uh oh!

abhinavdahiya commented Mar 13, 2019

Uh oh!

abhinavdahiya commented Mar 13, 2019

Uh oh!

openshift-ci-robot commented Mar 13, 2019

Uh oh!

Uh oh!

smarterclayton commented Mar 11, 2019 •

edited

Loading