Skip to content

Commit 60d164e

Browse files
authored
SuppressTransientErrors sub reconciler (#621)
Avoid status flapping for errors that are only passing in nature. The SuppressTransientErrors sub reconciler will skip status updates for transient errors until the error occurs three times. An error can be marked as non-transient with ErrDurable. Durable errors are those where absent a change in observed state, the error continues to be produced. SubReconcilerTestCase now includes AdditionalReconciles which can be used to run additional reconcile requests for the same reconciler instances. This is useful for testing state stored on the reconciler itself. Signed-off-by: Scott Andrews <[email protected]>
1 parent 4a54916 commit 60d164e

File tree

4 files changed

+732
-0
lines changed

4 files changed

+732
-0
lines changed

README.md

+28
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ Within an existing Kubebuilder or controller-runtime project, reconcilers.io may
2929
- [OverrideSetup](#overridesetup)
3030
- [WithConfig](#withconfig)
3131
- [WithFinalizer](#withfinalizer)
32+
- [SuppressTransientErrors](#suppresstransienterrors)
3233
- [AdmissionWebhookAdapter](#admissionwebhookadapter)
3334
- [Testing](#testing)
3435
- [ReconcilerTests](#reconcilertests)
@@ -700,6 +701,33 @@ func SyncExternalState() *reconcilers.SubReconciler[*resources.MyResource] {
700701
}
701702
```
702703

704+
#### SuppressTransientErrors
705+
706+
[`SuppressTransientErrors`](https://pkg.go.dev/reconciler.io/runtime/reconcilers#SuppressTransientErrors) prevents flapping status updates for transient errors. Status updates are allowed as normal for:
707+
- requests that don't error
708+
- requests that result in an [`ErrDurable`](https://pkg.go.dev/reconciler.io/runtime/reconcilers#ErrDurable) error
709+
- requests that repeatedly error exceeding the `Threshold` (defaults to 3)
710+
711+
Durable errors are always candidates for status updates and reset the transient counter. Errors not explicitly marked as durable are assumed to be transient. `ErrDurable` can either be returned directly, or joined/wrapped with other errors.
712+
713+
Nested SubReconcilers should reflect the content of the error on the resource's status to the best of their ability. The `SuppressTransientErrors` reconciler will indicate if the status update should be skipped.
714+
715+
**Example:**
716+
717+
`SuppressTransientErrors` should be installed as close to the root of the reconciler hierarchy as possible. to ensure all error sources are captured.
718+
719+
```go
720+
721+
func MyResourceReconciler(c reconcilers.Config) *reconcilers.ResourceReconciler[*resources.MyResource] {
722+
return &reconcilers.ResourceReconciler[*resources.MyResource]{
723+
Reconciler: reconcilers.SuppressTransientErrors[*resources.MyResource, *resources.MyResourceList]{
724+
Reconciler: SomethingThatMayFail(),
725+
},
726+
}
727+
}
728+
```
729+
730+
703731
### AdmissionWebhookAdapter
704732

705733
[`AdmissionWebhookAdapter`](https://pkg.go.dev/reconciler.io/runtime/reconcilers#AdmissionWebhookAdapter) allows using [SubReconciler](#subreconciler) to process [admission webhook requests](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#webhook-request-and-response). The full suite of sub-reconcilers are available, however, behavior that is [generally not accepted](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#side-effects) within a webhook is discouraged. For example, new requests against the API server are discouraged (reading from an informer is ok), mutation requests against the API Server can cause a loop with the webhook processing its own requests.

reconcilers/errors.go

+210
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,210 @@
1+
/*
2+
Copyright 2025 the original author or authors.
3+
4+
Licensed under the Apache License, Version 2.0 (the "License");
5+
you may not use this file except in compliance with the License.
6+
You may obtain a copy of the License at
7+
8+
http://www.apache.org/licenses/LICENSE-2.0
9+
10+
Unless required by applicable law or agreed to in writing, software
11+
distributed under the License is distributed on an "AS IS" BASIS,
12+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
See the License for the specific language governing permissions and
14+
limitations under the License.
15+
*/
16+
17+
package reconcilers
18+
19+
import (
20+
"context"
21+
"errors"
22+
"fmt"
23+
"sync"
24+
"time"
25+
26+
"github.com/go-logr/logr"
27+
"k8s.io/apimachinery/pkg/types"
28+
"k8s.io/apimachinery/pkg/util/sets"
29+
"reconciler.io/runtime/internal"
30+
rtime "reconciler.io/runtime/time"
31+
"reconciler.io/runtime/validation"
32+
ctrl "sigs.k8s.io/controller-runtime"
33+
"sigs.k8s.io/controller-runtime/pkg/builder"
34+
"sigs.k8s.io/controller-runtime/pkg/client"
35+
)
36+
37+
var (
38+
// ErrDurable is an error which the reconcile request should not be retried until the observed
39+
// state has changed. Meaningful state about the error has been captured on the status
40+
ErrDurable = errors.Join(ErrQuiet, ErrHaltSubReconcilers)
41+
)
42+
43+
type SuppressTransientErrors[Type client.Object, ListType client.ObjectList] struct {
44+
// Name used to identify this reconciler. Defaults to `ForEach`. Ideally unique, but not
45+
// required to be so.
46+
//
47+
// +optional
48+
Name string
49+
50+
// ListType is the listing type for the type. For example, PodList is the list type for Pod.
51+
// Required when the generic type is not a struct, or is unstructured.
52+
//
53+
// +optional
54+
ListType ListType
55+
56+
// Setup performs initialization on the manager and builder this reconciler will run with. It's
57+
// common to setup field indexes and watch resources.
58+
//
59+
// +optional
60+
Setup func(ctx context.Context, mgr ctrl.Manager, bldr *builder.Builder) error
61+
62+
// Threshold is the number of non-ErrDurable error reconciles encountered for a specific
63+
// resource generation before which status update is suppressed.
64+
Threshold uint8
65+
66+
// Reconciler to be called for each iterable item
67+
Reconciler SubReconciler[Type]
68+
69+
lazyInit sync.Once
70+
m sync.Mutex
71+
lastPurge time.Time
72+
errorCounter map[types.UID]transientErrorCounter
73+
}
74+
75+
func (r *SuppressTransientErrors[T, LT]) SetupWithManager(ctx context.Context, mgr ctrl.Manager, bldr *builder.Builder) error {
76+
r.init()
77+
78+
log := logr.FromContextOrDiscard(ctx).
79+
WithName(r.Name)
80+
ctx = logr.NewContext(ctx, log)
81+
82+
if err := r.Validate(ctx); err != nil {
83+
return err
84+
}
85+
if err := r.Reconciler.SetupWithManager(ctx, mgr, bldr); err != nil {
86+
return err
87+
}
88+
if r.Setup == nil {
89+
return nil
90+
}
91+
return r.Setup(ctx, mgr, bldr)
92+
}
93+
94+
func (r *SuppressTransientErrors[T, LT]) init() {
95+
r.lazyInit.Do(func() {
96+
if r.Name == "" {
97+
r.Name = "SuppressTransientErrors"
98+
}
99+
if internal.IsNil(r.ListType) {
100+
var nilLT LT
101+
r.ListType = newEmpty(nilLT).(LT)
102+
}
103+
if r.Threshold == 0 {
104+
r.Threshold = 3
105+
}
106+
r.errorCounter = map[types.UID]transientErrorCounter{}
107+
})
108+
}
109+
110+
func (r *SuppressTransientErrors[T, LT]) checkStaleCounters(ctx context.Context) {
111+
now := rtime.RetrieveNow(ctx)
112+
log := logr.FromContextOrDiscard(ctx)
113+
114+
r.m.Lock()
115+
defer r.m.Unlock()
116+
117+
if r.lastPurge.IsZero() {
118+
r.lastPurge = now
119+
return
120+
}
121+
if r.lastPurge.Add(24 * time.Hour).After(now) {
122+
return
123+
}
124+
125+
log.Info("purging stale resource counters")
126+
127+
c := RetrieveConfigOrDie(ctx)
128+
list := r.ListType.DeepCopyObject().(LT)
129+
if err := c.List(ctx, list); err != nil {
130+
log.Error(err, "purge failed to list resources")
131+
return
132+
}
133+
134+
validIds := sets.New[types.UID]()
135+
for _, item := range extractItems[T](list) {
136+
validIds.Insert(item.GetUID())
137+
}
138+
139+
counterIds := sets.New[types.UID]()
140+
for uid := range r.errorCounter {
141+
counterIds.Insert(uid)
142+
}
143+
144+
for _, uid := range counterIds.Difference(validIds).UnsortedList() {
145+
log.V(2).Info("purging counter", "id", uid)
146+
delete(r.errorCounter, uid)
147+
}
148+
149+
r.lastPurge = now
150+
}
151+
152+
func (r *SuppressTransientErrors[T, LT]) Validate(ctx context.Context) error {
153+
r.init()
154+
155+
// validate Reconciler
156+
if r.Reconciler == nil {
157+
return fmt.Errorf("SuppressTransientErrors %q must implement Reconciler", r.Name)
158+
}
159+
if validation.IsRecursive(ctx) {
160+
if v, ok := r.Reconciler.(validation.Validator); ok {
161+
if err := v.Validate(ctx); err != nil {
162+
return fmt.Errorf("SuppressTransientErrors %q must have a valid Reconciler: %w", r.Name, err)
163+
}
164+
}
165+
}
166+
167+
return nil
168+
}
169+
170+
func (r *SuppressTransientErrors[T, LT]) Reconcile(ctx context.Context, resource T) (Result, error) {
171+
log := logr.FromContextOrDiscard(ctx).
172+
WithName(r.Name)
173+
ctx = logr.NewContext(ctx, log)
174+
175+
defer r.checkStaleCounters(ctx)
176+
177+
result, err := r.Reconciler.Reconcile(ctx, resource)
178+
179+
if err == nil || errors.Is(err, ErrDurable) {
180+
delete(r.errorCounter, resource.GetUID())
181+
return result, err
182+
}
183+
184+
// concurrent map access is ok, since keys are resources specific and a given resource will never be processed concurrently
185+
counter, ok := r.errorCounter[resource.GetUID()]
186+
if !ok || counter.Generation != resource.GetGeneration() {
187+
counter = transientErrorCounter{
188+
Generation: resource.GetGeneration(),
189+
Count: 0,
190+
}
191+
}
192+
193+
// check overflow before incrementing
194+
if counter.Count != uint8(255) {
195+
counter.Count = counter.Count + 1
196+
r.errorCounter[resource.GetUID()] = counter
197+
}
198+
199+
if counter.Count < r.Threshold {
200+
// suppress status update
201+
return result, errors.Join(err, ErrSkipStatusUpdate, ErrQuiet)
202+
}
203+
204+
return result, err
205+
}
206+
207+
type transientErrorCounter struct {
208+
Generation int64
209+
Count uint8
210+
}

0 commit comments

Comments
 (0)