Skip to content

sync: Completely parallelize the initial payload #136

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 13, 2019

Conversation

smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Mar 11, 2019

Increase parallism to maximum tasks, increase backoff, wait longer for cluster operator, and generally try to load everything at the same time.

This results in a significant reduction in boot time and allows us to keep run-levels in the payload in the desired update order, not the desired install order.

Builds on top of #133

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 11, 2019
@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 11, 2019
@smarterclayton
Copy link
Contributor Author

/test e2e-aws

1 similar comment
@smarterclayton
Copy link
Contributor Author

/test e2e-aws

@smarterclayton smarterclayton force-pushed the fast_init branch 2 times, most recently from d3f6d5a to 2fb95a7 Compare March 12, 2019 03:21
@smarterclayton
Copy link
Contributor Author

Console continues to take the longest - last run was 11m, monitoring finished at 8m30 and console took just about 5m total.

@smarterclayton
Copy link
Contributor Author

/test e2e-aws

1 similar comment
@smarterclayton
Copy link
Contributor Author

/test e2e-aws

@smarterclayton
Copy link
Contributor Author

@smarterclayton
Copy link
Contributor Author

/retest

2 similar comments
@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton
Copy link
Contributor Author

/test e2e-aws

@smarterclayton
Copy link
Contributor Author

So I've seen one break out of 6, which is about what mainline is. I think this is ready for review @abhinavdahiya

@smarterclayton smarterclayton changed the title WIP: sync: Completely parallelize the initial payload sync: Completely parallelize the initial payload Mar 13, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 13, 2019
Increase parallism to maximum tasks, increase backoff, wait longer
for cluster operator, and generally try to load everything at the
same time.
Initialization can be bursty because we don't have any competing
workloads.
If the operator reports level and is available, we're good to go.
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 13, 2019
@smarterclayton
Copy link
Contributor Author

/retest

},
},
{
tasks: tasks("0000_01_x-y-z_file1", "0000_01_x-y-z_file2"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is same as fee2d06#diff-6c3be3194acbe263715006cdee7e9442R276 multiple grouped items in single node

@abhinavdahiya
Copy link
Contributor

This looks LGTM.
trying to how much speed up we get..
from this PR:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/136/pull-ci-openshift-cluster-version-operator-master-e2e-aws/468/artifacts/e2e-aws/clusterversion.json | jq '.items[] | select (.metadata.name == "version") | .status.history[0] | {started: .startedTime, completed: .completionTime}'
{
  "started": "2019-03-13T15:39:00Z",
  "completed": "2019-03-13T15:50:10Z"
}
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/136/pull-ci-openshift-cluster-version-operator-master-e2e-aws/464/artifacts/e2e-aws/clusterversion.json | jq '.items[] | select (.metadata.name == "version") | .status.history[0] | {started: .startedTime, completed: .completionTime}'
{
  "started": "2019-03-13T00:24:35Z",
  "completed": "2019-03-13T00:35:22Z"
}
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/136/pull-ci-openshift-cluster-version-operator-master-e2e-aws/458/artifacts/e2e-aws/clusterversion.json | jq '.items[] | select (.metadata.name == "version") | .status.history[0] | {started: .startedTime, completed: .completionTime}'
{
  "started": "2019-03-12T05:04:14Z",
  "completed": "2019-03-12T05:18:23Z"
}

on an average 11 min to level.

compared to CI runs on installer (without this change):

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1407/pull-ci-openshift-installer-master-e2e-aws/4464/artifacts/e2e-aws/clusterversion.json | jq '.items[] | select (.metadata.name == "version") | .status.history[0] | {started: .startedTime, completed: .completionTime}'
{
  "started": "2019-03-12T19:35:32Z",
  "completed": "2019-03-12T19:54:15Z"
}
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1407/pull-ci-openshift-installer-master-e2e-aws/4475/artifacts/e2e-aws/clusterversion.json | jq '.items[] | select (.metadata.name == "version") | .status.history[0] | {started: .startedTime, completed: .completionTime}'
{
  "started": "2019-03-13T15:29:54Z",
  "completed": "2019-03-13T15:48:44Z"
}
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1393/pull-ci-openshift-installer-master-e2e-aws/4470/artifacts/e2e-aws/clusterversion.json | jq '.items[] | select (.metadata.name == "version") | .status.history[0] | {started: .startedTime, completed: .completionTime}'
{
  "started": "2019-03-13T02:55:20Z",
  "completed": "2019-03-13T03:13:48Z"
}

so on an average 19min to level.

This adds a significant speed up. 💯

/approve

@abhinavdahiya
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 13, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit a4d54ef into openshift:master Mar 13, 2019
wking added a commit to wking/cluster-version-operator that referenced this pull request Nov 24, 2020
As long as the ClusterOperator is Available=True and setting the
expected versions, we should be able to move forward.  We'd softened
our install-time handling to act this way back in b0b4902
(clusteroperator: Don't block on failing during initialization,
2019-03-11, openshift#136) when the type was Failing.  The motivation there had
been install speed [1].  And a degraded operator may slow dependent
components in their own transitions.  But as long as the
operator/operand are available at all, it should not block depndent
components from transitioning.

We still have the critical ClusterOperatorDegraded waking admins up
when an operator goes Degraded=True for a while, we will just no
longer block updates at that point.  We won't block ReconcilingMode
manifest application either, but since that's already flattened and
permuted, and ClusterOperator tend to be towards the end of their
TaskNode, the impact on ReconcilingMode is minimal (except that we
will no longer go Failing=True in ClusterVersion when the only issue
is some Degraded=True ClusterOperator).

[1]: openshift#136 (comment)
wking added a commit to wking/cluster-version-operator that referenced this pull request Nov 24, 2020
We have blocking on this condition since 545c342 (api: make status
substruct on operatorstatus, 2018-10-15, openshift#31) when it was Failing.
We'd softened our install-time handling to act this way back in
b0b4902 (clusteroperator: Don't block on failing during
initialization, 2019-03-11, openshift#136), motivated by install speed [1].
And a degraded operator may slow dependent components in their own
transitions.  But as long as the operator/operand are available at
all, it should not block depndent components from transitioning, so
this commit removes the Degraded=True block from the remaining modes.

We still have the critical ClusterOperatorDegraded waking admins up
when an operator goes Degraded=True for a while, we will just no
longer block updates at that point.  We won't block ReconcilingMode
manifest application either, but since that's already flattened and
permuted, and ClusterOperator tend to be towards the end of their
TaskNode, the impact on ReconcilingMode is minimal (except that we
will no longer go Failing=True in ClusterVersion when the only issue
is some Degraded=True ClusterOperator).

[1]: openshift#136 (comment)
wking added a commit to wking/cluster-version-operator that referenced this pull request May 7, 2021
…usterOperatorDegraded

During install, the CVO has pushed manifests into the cluster as fast
as possible without blocking on "has the in-cluster resource leveled?"
since way back in b0b4902 (clusteroperator: Don't block on failing
during initialization, 2019-03-11, openshift#136).  That can lead to
ClusterOperatorDown and ClusterOperatorDegraded firing during install,
as we see in [1], where:

* ClusterOperatorDegraded started pending at 5:00:15Z [2].
* Install completed at 5:09:58Z [3].
* ClusterOperatorDegraded started firing at 5:10:04Z [2].
* ClusterOperatorDegraded stopped firing at 5:10:23Z [2].
* The e2e suite complained about [1]:

    alert ClusterOperatorDegraded fired for 15 seconds with labels: {... name="authentication"...} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1939580)

ClusterOperatorDown is similar, but I'll leave addressing it to a
separate commit.  For ClusterOperatorDegraded, the degraded condition
should not be particularly urgent [4], so we should be find bumping it
to 'warning' and using 'for: 30m' or something more relaxed than the
current 10m.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776
[2]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776
     group by (alertstate) (ALERTS{alertname="ClusterOperatorDegraded"})
[3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776/artifacts/e2e-aws-upi/clusterversion.json
[4]: openshift/api#916
wking added a commit to wking/cluster-version-operator that referenced this pull request May 24, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 25, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 25, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 25, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 25, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 25, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.

I'd tried dropping:

  backoff := st.Backoff

and passing st.Backoff directly to ExponentialBackoffWithContext, but
it turns out that Step() [1]:

  ... mutates the provided Backoff to update its Steps and Duration.

Luckily, Backoff has no pointer properties, so storing as a local
variable is sufficient to give us a fresh copy for the local
mutations.

[1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
wking added a commit to wking/cluster-version-operator that referenced this pull request Jun 8, 2021
…usterOperatorDegraded

During install, the CVO has pushed manifests into the cluster as fast
as possible without blocking on "has the in-cluster resource leveled?"
since way back in b0b4902 (clusteroperator: Don't block on failing
during initialization, 2019-03-11, openshift#136).  That can lead to
ClusterOperatorDown and ClusterOperatorDegraded firing during install,
as we see in [1], where:

* ClusterOperatorDegraded started pending at 5:00:15Z [2].
* Install completed at 5:09:58Z [3].
* ClusterOperatorDegraded started firing at 5:10:04Z [2].
* ClusterOperatorDegraded stopped firing at 5:10:23Z [2].
* The e2e suite complained about [1]:

    alert ClusterOperatorDegraded fired for 15 seconds with labels: {... name="authentication"...} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1939580)

ClusterOperatorDown is similar, but I'll leave addressing it to a
separate commit.  For ClusterOperatorDegraded, the degraded condition
should not be particularly urgent [4], so we should be find bumping it
to 'warning' and using 'for: 30m' or something more relaxed than the
current 10m.

This commit brings back

* fb5257d
  (install/0000_90_cluster-version-operator_02_servicemonitor: Soften
  ClusterOperatorDegraded, 2021-05-06, openshift#554) and
* 92ed7f1
  (install/0000_90_cluster-version-operator_02_servicemonitor: Update
  ClusterOperatorDegraded message to 30m, 2021-05-08, openshift#556).

There are some conflicts, because I am not bringing back 90539f9
(pkg/cvo/metrics: Ignore Degraded for cluster_operator_up, 2021-04-26, openshift#550).
But that one had its own conflicts in metrics.go [5], and the
conflicts with this commit were orthogonal context issues, so moving
this back to 4.7 first won't make it much harder to bring back openshift#550
and such later on, if we decide to do that.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776
[2]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776
     group by (alertstate) (ALERTS{alertname="ClusterOperatorDegraded"})
[3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776/artifacts/e2e-aws-upi/clusterversion.json
[4]: openshift/api#916
[5]: openshift#550 (comment)
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 26, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.

I'd tried dropping:

  backoff := st.Backoff

and passing st.Backoff directly to ExponentialBackoffWithContext, but
it turns out that Step() [1]:

  ... mutates the provided Backoff to update its Steps and Duration.

Luckily, Backoff has no pointer properties, so storing as a local
variable is sufficient to give us a fresh copy for the local
mutations.

[1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 26, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.

I'd tried dropping:

  backoff := st.Backoff

and passing st.Backoff directly to ExponentialBackoffWithContext, but
it turns out that Step() [1]:

  ... mutates the provided Backoff to update its Steps and Duration.

Luckily, Backoff has no pointer properties, so storing as a local
variable is sufficient to give us a fresh copy for the local
mutations.

[1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 26, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.

I'd tried dropping:

  backoff := st.Backoff

and passing st.Backoff directly to ExponentialBackoffWithContext, but
it turns out that Step() [1]:

  ... mutates the provided Backoff to update its Steps and Duration.

Luckily, Backoff has no pointer properties, so storing as a local
variable is sufficient to give us a fresh copy for the local
mutations.

[1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 26, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.

I'd tried dropping:

  backoff := st.Backoff

and passing st.Backoff directly to ExponentialBackoffWithContext, but
it turns out that Step() [1]:

  ... mutates the provided Backoff to update its Steps and Duration.

Luckily, Backoff has no pointer properties, so storing as a local
variable is sufficient to give us a fresh copy for the local
mutations.

[1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 29, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.

I'd tried dropping:

  backoff := st.Backoff

and passing st.Backoff directly to ExponentialBackoffWithContext, but
it turns out that Step() [1]:

  ... mutates the provided Backoff to update its Steps and Duration.

Luckily, Backoff has no pointer properties, so storing as a local
variable is sufficient to give us a fresh copy for the local
mutations.

[1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 2, 2022
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.

I'd tried dropping:

  backoff := st.Backoff

and passing st.Backoff directly to ExponentialBackoffWithContext, but
it turns out that Step() [1]:

  ... mutates the provided Backoff to update its Steps and Duration.

Luckily, Backoff has no pointer properties, so storing as a local
variable is sufficient to give us a fresh copy for the local
mutations.

[1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
wking added a commit to wking/cluster-version-operator that referenced this pull request Nov 25, 2024
We have blocking on this condition since 545c342 (api: make status
substruct on operatorstatus, 2018-10-15, openshift#31) when it was Failing.
We'd softened our install-time handling to act this way back in
b0b4902 (clusteroperator: Don't block on failing during
initialization, 2019-03-11, openshift#136), motivated by install speed [1].
And a degraded operator may slow dependent components in their own
transitions.  But as long as the operator/operand are available at
all, it should not block depndent components from transitioning, so
this commit removes the Degraded=True block from the remaining modes.

We still have the warning ClusterOperatorDegraded alerting admins
when an operator goes Degraded=True for a while, we will just no
longer block updates at that point.  We won't block ReconcilingMode
manifest application either, but since that's already flattened and
permuted, and ClusterOperator tend to be towards the end of their
TaskNode, the impact on ReconcilingMode is minimal (except that we
will no longer go Failing=True in ClusterVersion when the only issue
is some Degraded=True ClusterOperator).

[1]: openshift#136 (comment)
wking added a commit to wking/cluster-version-operator that referenced this pull request Nov 25, 2024
We have blocking on this condition since 545c342 (api: make status
substruct on operatorstatus, 2018-10-15, openshift#31) when it was Failing.
We'd softened our install-time handling to act this way back in
b0b4902 (clusteroperator: Don't block on failing during
initialization, 2019-03-11, openshift#136), motivated by install speed [1].
And a degraded operator may slow dependent components in their own
transitions.  But as long as the operator/operand are available at
all, it should not block depndent components from transitioning, so
this commit removes the Degraded=True block from the remaining modes.

We still have the warning ClusterOperatorDegraded alerting admins
when an operator goes Degraded=True for a while, we will just no
longer block updates at that point.  We won't block ReconcilingMode
manifest application either, but since that's already flattened and
permuted, and ClusterOperator tend to be towards the end of their
TaskNode, the impact on ReconcilingMode is minimal (except that we
will no longer go Failing=True in ClusterVersion when the only issue
is some Degraded=True ClusterOperator).

[1]: openshift#136 (comment)
wking added a commit to wking/cluster-version-operator that referenced this pull request Nov 27, 2024
We have blocking on this condition since 545c342 (api: make status
substruct on operatorstatus, 2018-10-15, openshift#31) when it was Failing.
We'd softened our install-time handling to act this way back in
b0b4902 (clusteroperator: Don't block on failing during
initialization, 2019-03-11, openshift#136), motivated by install speed [1].
And a degraded operator may slow dependent components in their own
transitions.  But as long as the operator/operand are available at
all, it should not block depndent components from transitioning, so
this commit removes the Degraded=True block from the remaining modes.

We still have the warning ClusterOperatorDegraded alerting admins
when an operator goes Degraded=True for a while, we will just no
longer block updates at that point.  We won't block ReconcilingMode
manifest application either, but since that's already flattened and
permuted, and ClusterOperator tend to be towards the end of their
TaskNode, the impact on ReconcilingMode is minimal (except that we
will no longer go Failing=True in ClusterVersion when the only issue
is some Degraded=True ClusterOperator).

[1]: openshift#136 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants