Skip to content

Bug 1866480: pkg/cvo/status: Raise Operator leveling grace-period to 20 minutes #427

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions docs/user/reconciliation.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,22 +93,22 @@ So the graph nodes are all parallelized with the by-number ordering flattened ou

For the usual reconciliation loop (neither an upgrade between releases nor a fresh install), the flattened graph is also randomly permuted to avoid hanging on ordering bugs.

## Synchronizing the graph
## Reconciling the graph
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Reconciling the graph
## Reconciling the cluster version

nit: "graph" sounds a bit ambiguous to me - I've immediately though of "updates graph" instead "manifests graph"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a backport PR. Drop an /lgtm, and then PR master with your change to "manifest graph", which sounds good to me.


The cluster-version operator spawns worker goroutines that walk the graph, pushing manifests in their queue.
For each manifest in the node, the worker synchronizes the cluster with the manifest using a resource builder.
For each manifest in the node, the worker reconciles the cluster with the manifest using a resource builder.
On error (or timeout), the worker abandons the manifest, graph node, and any dependencies of that graph node.
On success, the worker proceeds to the next manifest in the graph node.

## Resource builders

Resource builders synchronize the cluster with a manifest from the release image.
Resource builders reconcile a cluster object with a manifest from the release image.
The general approach is to generates a merged manifest combining critical spec properties from the release-image manifest with data from a preexisting in-cluster object, if any.
If the merged manifest differs from the in-cluster object, the merged manifest is pushed back into the cluster.

Some types have additional logic, as described in the following subsections.
Note that this logic only applies to manifests included in the release image itself.
For example, only [ClusterOperator](../dev/clusteroperator.md) from the release image will have the blocking logic described [below](#clusteroperator); if an admin or secondary operator pushed a ClusterOperator object, it would not impact the cluster-version operator's graph synchronization.
For example, only [ClusterOperator](../dev/clusteroperator.md) from the release image will have the blocking logic described [below](#clusteroperator); if an admin or secondary operator pushed a ClusterOperator object, it would not impact the cluster-version operator's graph reconciliation.

### ClusterOperator

Expand Down
22 changes: 22 additions & 0 deletions docs/user/status.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,27 @@
[The ClusterVersion object](../dev/clusterversion.md) sets `conditions` describing the state of the cluster-version operator (CVO).
This document describes those conditions and, where appropriate, suggests possible mitigations.

## Failing

When `Failing` is True, the CVO is failing to reconcile the cluster with the desired release image.
In all cases, the impact on the cluster will be that dependent nodes in [the manifest graph](reconciliation.md#manifest-graph) may not be [reconciled](reconciliation.md#reconciling-the-graph).
Note that the graph [may be flattened](reconciliation.md#manifest-graph), in which case there are no dependent nodes.

Most reconciliation errors will result in `Failing=True`, although [`ClusterOperatorNotAvailable`](#clusteroperatornotavailable) has special handling.

### NoDesiredImage

The CVO has not been given a release image to reconcile.

If this happens it is a CVO coding error, because clearing [`desiredUpdate`][api-desired-update] should return you to the current CVO's release image.

### ClusterOperatorNotAvailable

`ClusterOperatorNotAvailable` (or the consolidated `ClusterOperatorsNotAvailable`) is set when the CVO fails to retrieve the ClusterOperator from the cluster or when the retrieved ClusterOperator does not satisfy [the reconciliation conditions](reconciliation.md#clusteroperator).

Unlike most manifest-reconciliation failures, this error does not immediately result in `Failing=True`.
Under some conditions during installs and updates, the CVO will treat this condition as a `Progressing=True` condition and give the operator up to twenty minutes to level before reporting `Failing=True`.

## RetrievedUpdates

When `RetrievedUpdates` is `True`, the CVO is succesfully retrieving updates, which is good.
Expand Down Expand Up @@ -107,5 +128,6 @@ If this error occurs because you forced an update to a release that is not in an
If this happens it is a CVO coding error.
There is no mitigation short of updating to a new release image with a fixed CVO.

[api-desired-update]: https://github.com/openshift/api/blob/34f54f12813aaed8822bb5bc56e97cbbfa92171d/config/v1/types_cluster_version.go#L40-L54
[channels]: https://docs.openshift.com/container-platform/4.3/updating/updating-cluster-between-minor.html#understanding-upgrade-channels_updating-cluster-between-minor
[Cincinnati]: https://github.com/openshift/cincinnati/blob/master/docs/design/openshift.md
5 changes: 4 additions & 1 deletion pkg/cvo/cvo.go
Original file line number Diff line number Diff line change
Expand Up @@ -472,7 +472,10 @@ func (optr *Operator) sync(key string) error {
// handle the case of a misconfigured CVO by doing nothing
if len(desired.Image) == 0 {
return optr.syncStatus(original, config, &SyncWorkerStatus{
Failure: fmt.Errorf("No configured operator version, unable to update cluster"),
Failure: &payload.UpdateError{
Reason: "NoDesiredImage",
Message: "No configured operator version, unable to update cluster",
},
}, errs)
}

Expand Down
12 changes: 6 additions & 6 deletions pkg/cvo/cvo_scenarios_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -165,8 +165,8 @@ func TestCVO_StartupAndSync(t *testing.T) {
Conditions: []configv1.ClusterOperatorStatusCondition{
{Type: configv1.OperatorAvailable, Status: configv1.ConditionFalse},
// report back to the user that we don't have enough info to proceed
{Type: ClusterStatusFailing, Status: configv1.ConditionTrue, Message: "No configured operator version, unable to update cluster"},
{Type: configv1.OperatorProgressing, Status: configv1.ConditionTrue, Message: "Unable to apply <unknown>: an error occurred"},
{Type: ClusterStatusFailing, Status: configv1.ConditionTrue, Reason: "NoDesiredImage", Message: "No configured operator version, unable to update cluster"},
{Type: configv1.OperatorProgressing, Status: configv1.ConditionTrue, Reason: "NoDesiredImage", Message: "Unable to apply <unknown>: an unknown error has occurred: NoDesiredImage"},
{Type: configv1.RetrievedUpdates, Status: configv1.ConditionFalse},
},
},
Expand Down Expand Up @@ -436,8 +436,8 @@ func TestCVO_StartupAndSyncUnverifiedPayload(t *testing.T) {
Conditions: []configv1.ClusterOperatorStatusCondition{
{Type: configv1.OperatorAvailable, Status: configv1.ConditionFalse},
// report back to the user that we don't have enough info to proceed
{Type: ClusterStatusFailing, Status: configv1.ConditionTrue, Message: "No configured operator version, unable to update cluster"},
{Type: configv1.OperatorProgressing, Status: configv1.ConditionTrue, Message: "Unable to apply <unknown>: an error occurred"},
{Type: ClusterStatusFailing, Status: configv1.ConditionTrue, Reason: "NoDesiredImage", Message: "No configured operator version, unable to update cluster"},
{Type: configv1.OperatorProgressing, Status: configv1.ConditionTrue, Reason: "NoDesiredImage", Message: "Unable to apply <unknown>: an unknown error has occurred: NoDesiredImage"},
{Type: configv1.RetrievedUpdates, Status: configv1.ConditionFalse},
},
},
Expand Down Expand Up @@ -697,8 +697,8 @@ func TestCVO_StartupAndSyncPreconditionFailing(t *testing.T) {
Conditions: []configv1.ClusterOperatorStatusCondition{
{Type: configv1.OperatorAvailable, Status: configv1.ConditionFalse},
// report back to the user that we don't have enough info to proceed
{Type: ClusterStatusFailing, Status: configv1.ConditionTrue, Message: "No configured operator version, unable to update cluster"},
{Type: configv1.OperatorProgressing, Status: configv1.ConditionTrue, Message: "Unable to apply <unknown>: an error occurred"},
{Type: ClusterStatusFailing, Status: configv1.ConditionTrue, Reason: "NoDesiredImage", Message: "No configured operator version, unable to update cluster"},
{Type: configv1.OperatorProgressing, Status: configv1.ConditionTrue, Reason: "NoDesiredImage", Message: "Unable to apply <unknown>: an unknown error has occurred: NoDesiredImage"},
{Type: configv1.RetrievedUpdates, Status: configv1.ConditionFalse},
},
},
Expand Down
4 changes: 2 additions & 2 deletions pkg/cvo/status.go
Original file line number Diff line number Diff line change
Expand Up @@ -331,13 +331,13 @@ func (optr *Operator) syncStatus(original, config *configv1.ClusterVersion, stat

// convertErrorToProgressing returns true if the provided status indicates a failure condition can be interpreted as
// still making internal progress. The general error we try to suppress is an operator or operators still being
// unavailable AND the general payload task making progress towards its goal. An operator is given 10 minutes since
// unavailable AND the general payload task making progress towards its goal. An operator is given 20 minutes since
// its last update to go ready, or an hour has elapsed since the update began, before the condition is ignored.
func convertErrorToProgressing(history []configv1.UpdateHistory, now time.Time, status *SyncWorkerStatus) (reason string, message string, ok bool) {
if len(history) == 0 || status.Failure == nil || status.Reconciling || status.LastProgress.IsZero() {
return "", "", false
}
if now.Sub(status.LastProgress) > 10*time.Minute || now.Sub(history[0].StartedTime.Time) > time.Hour {
if now.Sub(status.LastProgress) > 20*time.Minute || now.Sub(history[0].StartedTime.Time) > time.Hour {
return "", "", false
}
uErr, ok := status.Failure.(*payload.UpdateError)
Expand Down