Skip to content

Commit ab09488

Browse files
SgtCoDFishmaelvls
andcommitted
timeouts proposal
Signed-off-by: Ashley Davis <[email protected]> Co-authored-by: Maël Valais <[email protected]>
1 parent c16b3cc commit ab09488

File tree

1 file changed

+317
-0
lines changed

1 file changed

+317
-0
lines changed

design/20220614-timeouts.md

+317
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,317 @@
1+
# Increasing Timeouts for Issuer Operations
2+
3+
- [Release Signoff Checklist](#release-signoff-checklist)
4+
- [Summary](#summary)
5+
- [Motivation](#motivation)
6+
- [Goals](#goals)
7+
- [Non-Goals](#non-goals)
8+
- [Proposal](#proposal)
9+
- [Risks and Mitigations](#risks-and-mitigations)
10+
- [Design Details](#design-details)
11+
- [Test Plan](#test-plan)
12+
- [Graduation Criteria](#graduation-criteria)
13+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
14+
- [Supported Versions](#supported-versions)
15+
- [Production Readiness](#production-readiness)
16+
- [Drawbacks](#drawbacks)
17+
- [Alternatives](#alternatives)
18+
19+
## Release Signoff Checklist
20+
21+
This checklist contains actions which must be completed before a PR implementing this design can be merged.
22+
23+
24+
- [ ] This design doc has been discussed and approved
25+
- [ ] Test plan has been agreed upon and the tests implemented
26+
- [ ] Feature gate status has been agreed upon (whether the new functionality will be placed behind a feature gate or not)
27+
- [ ] Graduation criteria is in place if required (if the new functionality is placed behind a feature gate, how will it graduate between stages)
28+
- [ ] User-facing documentation has been PR-ed against the release branch in [cert-manager/website]
29+
30+
31+
## Summary
32+
33+
Users of several different issuers (including some deployments of Venafi ([5108][], [4893][]), and ACME
34+
issuers ([4452][]) such as Sectigo ([comment][4893-comment]) and ZeroSSL ([comment][website-583-comment]) have been running
35+
into timeout errors which manifest as `context deadline exceeded` errors in logs.
36+
37+
These errors are caused because the relevant cert-manager controllers hardcode short timeouts into their sync/reconcile
38+
functions, and some operations which must be done during a sync/reconcile can take longer than that short timeout.
39+
40+
This manifests during issuer creation and during certificate issuance. Put simply: for some issuers, initialisation or
41+
certificate issuance just takes longer than the maximum amount of time we currently allow.
42+
43+
[5108]: https://github.com/cert-manager/cert-manager/issues/5108
44+
[4893]: https://github.com/cert-manager/cert-manager/issues/4893
45+
[4452]: https://github.com/cert-manager/cert-manager/issues/4452
46+
[4893-comment]: https://github.com/cert-manager/cert-manager/issues/4893#issuecomment-1115143729
47+
[website-583-comment]: https://github.com/cert-manager/website/issues/583#issuecomment-1011664334
48+
49+
## Motivation
50+
51+
Making a change here is justified since there are users today who want to use certain standards-compliant ACME issuers are
52+
who're unable to do so with cert-manager, through no fault of their own.
53+
54+
There's likely to be strong demand for this; in the various issues linked in the summary above, we've seen a huge amount
55+
of engagement through reactions to posts (which isn't a perfect indicator of demand but these issues are unusually
56+
popular relative to what we normally see).
57+
58+
This design will largely talk about ACME since ACME issuer users are almost certainly by far the biggest section of the
59+
current cert-manager userbase, but the principles here apply equally to the Venafi issuer, where an instance of
60+
Venafi TPP might be deployed on-prem and could be slow for any number of reasons. We should address Venafi issuers in
61+
a similar way, but in a separate piece of work.
62+
63+
### Goals
64+
65+
#### Remain Good Cluster Citizens
66+
67+
The justification for a short default timeout on each sync/reconcile operation is that we avoid a situation whereby we
68+
create many "hanging" threads which could overload the node on which cert-manager is running.
69+
70+
The simplest solution to this issue would be to simply remove all timeouts or to make them all very long (say, 1 hour),
71+
but that could lead to high resource use.
72+
73+
#### Merge Before 1.9 and Backport
74+
75+
This timeout issue is present in all currently supported versions of cert-manager (1.7 and 1.8 at the time of writing).
76+
Ideally we'd like to get this fix merged before we release 1.9 (currently scheduled for Jul 06, 2022) and to backport
77+
to other supported releases, so that we'll be able to support ACME issuers like Sectigo in multiple versions.
78+
79+
### Non-Goals
80+
81+
#### Individually Configurable Timeouts for Every Operation or API Call
82+
83+
Having such timeouts might be attractive at some point, but specifically we want a broader approach for this solution
84+
pending some evidence that more specific timeout configuration is a thing that people actually want.
85+
86+
#### Adding New Timeouts Where We Don't Currently Have Them
87+
88+
While we should almost certainly have an upper-limit timeout for every single controller to prevent runaway resource
89+
usage through infinitely hanging sync operations, it would be preferable to keep the scope of this PR minimal so that
90+
it'll pass the threshold for simplicity to be backported.
91+
92+
## Proposal
93+
94+
Note that this proposal is intentionally kept simple in terms of the scope of what we should actually change.
95+
96+
This is because we take an _extremely_ conservative approach to backports where we seek to absolutely minimise the
97+
amount and complexity of changes we introduce to already released versions.
98+
99+
Hence this proposal will focus on a simple approach that would apply everywhere, and which we could build upon in any
100+
future releases if we so chose.
101+
102+
Note also that the specific numeric values for timeouts are given and justified later after the changes are proposed.
103+
This makes the timeouts easier to change. Timeouts are instead given Greek letter names such as `α` and `β`.
104+
105+
An executive summary of the changes would be to remove a few middleware timeouts and then increase the hardcoded
106+
defaults we currently have in a few places. Really all of the below is just saying that in long form. The justification
107+
is in the [Design Details](#design-details) section below.
108+
109+
### Increase Some Existing Controller Timeouts
110+
111+
Note that links in this section are for the `release-1.8` branch but these changes would also apply similarly to our
112+
other currently-supported version of cert-manager (v1.7) and also for the `master` branch.
113+
114+
We propose to increase the existing timeouts for the [issuers][] and [clusterissuers][] controllers to
115+
`α` seconds.
116+
117+
[issuers]: https://github.com/cert-manager/cert-manager/blob/2918f68664be98f93b8092afbbc503e42d9b7588/pkg/controller/issuers/sync.go#L45
118+
[clusterissuers]: https://github.com/cert-manager/cert-manager/blob/2918f68664be98f93b8092afbbc503e42d9b7588/pkg/controller/clusterissuers/sync.go#L45
119+
120+
### Remove Middleware Timeouts
121+
122+
We propose to remove the context timeouts we currently have in our ACME logging middleware; that would look like
123+
(or be) [this unmerged commit][b1953].
124+
125+
The affected controllers appear to be those which have an `accounts.Getter`:
126+
127+
- [acmechallenges](https://github.com/cert-manager/cert-manager/blob/c16b3cca7b418ba0d0b2bf1066514b8762984517/pkg/controller/acmechallenges/controller.go#L48)
128+
- [acmeorders](https://github.com/cert-manager/cert-manager/blob/c16b3cca7b418ba0d0b2bf1066514b8762984517/pkg/controller/acmeorders/controller.go#L50)
129+
130+
These timeouts have two issues. One is that the location they're added is unintuitive; the timeouts are added in
131+
_logging_ middleware which which doesn't otherwise mention that it also introduces timeouts.
132+
133+
That's confusing; we might reasonably expect a timeout on writing the logs themselves (i.e. the actual operation of
134+
writing to a log) but this functionality doesn't manage that.
135+
136+
The second issue is that these timeouts effectively duplicate HTTP client timeouts.
137+
138+
HTTP client timeouts belong on the underlying HTTP client; that's where we could set more finegrained controls such as
139+
TLS handshake, dialer and overall HTTP request timeouts. HTTP client timeouts are [desirable](https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/),
140+
and we [actually already have them](https://github.com/cert-manager/cert-manager/blob/e116d416f3b14863d05753739cbdf72d66923357/pkg/acme/accounts/client.go#L58-L75)
141+
for our ACME clients.
142+
143+
It's worth noting that if we simply removed the logging middleware timeout, the HTTP client timeouts would start to
144+
take effect. They're longer than the context timeout and so wouldn't ever be relevant on cert-manager today, but would
145+
be relevant without the middleware context timeout. We also propose below to tweak those HTTP timeouts in any case.
146+
147+
[b1953]: https://github.com/cert-manager/cert-manager/pull/5206/commits/b195382a67c5e548752407d12851b84b4d6b8e28
148+
149+
### Increase ACME HTTP Timeouts
150+
151+
We propose to update the overall timeout for our HTTP clients for ACME requests to `β` seconds and - at least for now -
152+
**not** to make this configurable by users.
153+
154+
As mentioned above, we already have HTTP timeouts on the HTTP clients we build for use with ACME clients, as seen
155+
[here](https://github.com/cert-manager/cert-manager/blob/e116d416f3b14863d05753739cbdf72d66923357/pkg/acme/accounts/client.go#L58-L75).
156+
157+
The dialer and TLS handshake timeouts are set to 30 and 10 seconds respectively, and both are likely fine to keep as
158+
they are and in any case unlikely to be a problem for people experiencing the issues detailed in [#5080](https://github.com/cert-manager/cert-manager/issues/5080).
159+
160+
The overall HTTP timeout (see [this blog](https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/)
161+
for a diagram showing the different client timeouts) is [set to 30s](https://github.com/cert-manager/cert-manager/blob/e116d416f3b14863d05753739cbdf72d66923357/pkg/acme/accounts/client.go#L73)
162+
which is too short to address the issues we're trying fix here.
163+
164+
### Risks and Mitigations
165+
166+
The risk here is that the timeouts we have already, as currently configured, might _already_ be the last line of defence
167+
for some controllers in some clusters. By that we mean that by increasing the timeout we could pass some critical
168+
threshold where pods start to run out of resources and crash.
169+
170+
This could be in a cluster where someone is trying to create many slow-to-create Issuers at once, or to issue many
171+
Certificates at once from issuers which are slow, or to issue many Certificates from a single issuer which becomes
172+
overloaded and slows down.
173+
174+
## Design Details
175+
176+
Most of the above proposals involve tweaking integers or else they have [a commit](https://github.com/cert-manager/cert-manager/pull/5206/commits/b195382a67c5e548752407d12851b84b4d6b8e28)
177+
which can be viewed. The actual PR here won't be difficult to understand - that's intentional, to minimise complexity
178+
of a change we want to backport.
179+
180+
The proposed value for `β` - the ACME HTTP timeout - is 90s. That's longer than the 60s which seems to be around the
181+
upper limit of what people have reported seeing with Sectigo, which seems to be on the slower side.
182+
183+
The proposed value for `α` - the context timeout - is 120s. This is long enough to fit in a slow ACME request and then
184+
do other work relating to issuance.
185+
186+
### Justification
187+
188+
#### Crossplane
189+
190+
The above values are mostly inspired by crossplane.io, which uses context timeouts very heavily in basically all
191+
controllers (which is something that cert-manager should likely aspire to).
192+
193+
Here's a selection of timeouts in various crossplane controllers:
194+
195+
| Timeout | Reconciliation loop |
196+
|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
197+
| 2 mins | [definition/reconciler.go#L55](https://github.com/crossplane/crossplane/blob/60fc7df4b3c9d0f11e9ea719b63b52bec56d3853/internal/controller/apiextensions/definition/reconciler.go#L55) |
198+
| 2 mins | [definition/reconciler.go#L43](https://github.com/crossplane/crossplane/blob/134ec72e0a5649147bb95e76d063d3df8c0b4e5f/internal/controller/rbac/definition/reconciler.go#L43) |
199+
| 2 mins | [composition/reconciler.go#L45](https://github.com/crossplane/crossplane/blob/134ec72e0a5649147bb95e76d063d3df8c0b4e5f/internal/controller/apiextensions/composition/reconciler.go#L45) |
200+
| 2 mins | [binding/reconciler.go#L46](https://github.com/crossplane/crossplane/blob/134ec72e0a5649147bb95e76d063d3df8c0b4e5f/internal/controller/rbac/provider/binding/reconciler.go#L46) |
201+
| 2 mins | [namespace/reconciler.go#L43](https://github.com/crossplane/crossplane/blob/134ec72e0a5649147bb95e76d063d3df8c0b4e5f/internal/controller/rbac/namespace/reconciler.go#L43) |
202+
| 2 mins | [roles/reconciler.go#L45](https://github.com/crossplane/crossplane/blob/134ec72e0a5649147bb95e76d063d3df8c0b4e5f/internal/controller/rbac/provider/roles/reconciler.go#L45) |
203+
| 2 mins | [composite/reconciler.go#L45](https://github.com/crossplane/crossplane/blob/dd23304b466690f89a82b57825ca8a9870767972/internal/controller/apiextensions/composite/reconciler.go#L45) |
204+
| 1 min | [offered/reconciler.go#L53](https://github.com/crossplane/crossplane/blob/2c9872543c5074928ba739e147b02e09cd1f3090/internal/controller/apiextensions/offered/reconciler.go#L53) |
205+
| 1 min | [claim/reconciler.go#L46](https://github.com/crossplane/crossplane/blob/dd23304b466690f89a82b57825ca8a9870767972/internal/controller/apiextensions/claim/reconciler.go#L46) |
206+
| 1 min | [manager/reconciler.go#L45](https://github.com/crossplane/crossplane/blob/1344a86018f03fcbba6b9365c63696dcd47ebcf6/internal/controller/pkg/manager/reconciler.go#L45) |
207+
| 1 min | [resolver/reconciler.go#L47](https://github.com/crossplane/crossplane/blob/243f1f47ca0b17503b08ef77770f41a89defeb82/internal/controller/pkg/resolver/reconciler.go#L47) |
208+
| 3 mins | [revision/reconciler.go#L54](https://github.com/crossplane/crossplane/blob/1344a86018f03fcbba6b9365c63696dcd47ebcf6/internal/controller/pkg/revision/reconciler.go#L54) |
209+
210+
Crossplane also has an [open issue](https://github.com/crossplane/crossplane/issues/2564) relating to making this value
211+
configurable.
212+
213+
The idea here is "if it's good enough for crossplane why should it not be good enough for us?"
214+
215+
The current cert-manager timeouts are arbitrary. Likely the crossplane timeouts are also arbitrary. We can at least
216+
have confidence that a big project with a tonne of controllers and CRDs is using longer timeouts and clearly not seeing
217+
world-ending problems, and people want to _increase_ the timeouts from that base too, as envidenced by the above open
218+
issue.
219+
220+
Another relevant timeout is certbot, which has a [45s](https://github.com/certbot/certbot/blob/295fc5e33a68c945d2f62e84ed8e6aaecfe93102/acme/acme/client.py#L46)
221+
limit on network requests by default and a 90s limit on polling after the fact. We go higher here because people have
222+
reported their requests taking longer than 45
223+
224+
#### Kubernetes
225+
226+
In investigation we couldn't find any evidence that there are timeouts used in kubernetes controllers such as the Deployments
227+
controller. That implies they don't think this topic is particularly scary, which in turn provides some comfort that picking
228+
a higher timeout isn't likely to bring about the end of the world.
229+
230+
#### User Reports
231+
232+
User reports of time taken vary slightly:
233+
234+
- One user reported times of around 30s: https://github.com/cert-manager/cert-manager/issues/4452#issuecomment-1144561276
235+
- Another reported similar: https://github.com/cert-manager/cert-manager/issues/4452#issuecomment-1143889347
236+
- One said "at least 60s, sometimes more": https://github.com/cert-manager/cert-manager/issues/5080#issuecomment-1137259557
237+
238+
Since the largest time is 'sometimes more than 60s', our lowest timeout should likely be at least 60s.
239+
240+
#### General Justifications
241+
242+
There's a risk with work like this that we can focus too much on what's already present. The 10s timeout which is set for these
243+
controllers on `master` was committed with absolutely no justification we can find in the PR, commit message, design doc or anywhere else.
244+
245+
We should the same amount of attachment to that 10 second timeout as there was justification given for it when it was added originally - absolutely none.
246+
247+
That's not to say it was a mistake or bad to add these timeouts originally - the controllers which have them are better than almost every other controller
248+
we have just by virtue of having a timeout at all!
249+
250+
### Test Plan
251+
252+
As far as we can tell, none of this is tested currently, nor would we plan to add any new tests in the PR.
253+
254+
Down the road it would be great to add some tests for this stuff but that feels like a bigger project than this short
255+
term bug fix.
256+
257+
### Graduation Criteria
258+
259+
There's nothing to feature gate and no graduation required.
260+
261+
### Upgrade / Downgrade Strategy
262+
263+
A user upgrading will notice no changes except that the failing ACME issuers they currently have should start working.
264+
265+
A user downgrading to an unsupported version (1.6 and below) could start to see timeout issues since the timeouts in
266+
unsupported releases will be lower than those in 1.7 and above.
267+
268+
### Supported Versions
269+
270+
cert-manager 1.7+
271+
272+
K8s support irrelevant
273+
274+
## Production Readiness
275+
276+
### How can this feature be enabled / disabled for an existing cert-manager installation?
277+
278+
This will take effect by default in all versions and won't be able to be disabled.
279+
280+
### Does this feature depend on any specific services running in the cluster?
281+
282+
Yes, cert-manager
283+
284+
### Will enabling / using this feature result in new API calls (i.e to Kubernetes apiserver or external services)?
285+
286+
This change will result in fewer API calls to external services and to Kubernetes. Since issuance on slow ACME issuers
287+
will be more likely to succeed, we won't have long retry loops where we'll continually retry and get timed out.
288+
289+
Put another way: instead of 20 requests where 19 of them time out and one succeeds, we'll just have one request that
290+
succeeds.
291+
292+
### Will enabling / using this feature result in increasing size or count of the existing API objects?
293+
294+
Potentially fewer Orders and Challenges, since we'll fail less often.
295+
296+
### Will enabling / using this feature result in significant increase of resource usage? (CPU, RAM...)
297+
298+
If there's a massive queue of blocked requests there could be some increase in resource usage. Hard to say how it would
299+
change — if anything there's likely to be a _decrease_ in total resource usage since longer-running requests will
300+
succeed and we won't need to pay the cost of running a whole Sync function again after an ACME request times out.
301+
302+
## Drawbacks
303+
304+
Once we bump the default max timeout we can't realistically reduce the default again, since that could be perceived as
305+
backwards-incompatible.
306+
307+
To reduce timeouts we'd likely need to add a flag for configuring the timeouts and let cluster admins make the choice.
308+
309+
## Alternatives
310+
311+
The obvious alternative is to add a flag or multiple flags to control these timeouts. That was rejected because it's a
312+
larger change and we want to backport these changes.
313+
314+
Another option is to increase the hardcoded timeouts everywhere _and_ add a flag to configure this on master, so that
315+
the fix is applied on older branches but we have the functionality of a flag. This was rejected because it's more work
316+
and we don't have any evidence that we need to add a flag. We'd also have to face the difficulty of deciding how to
317+
scope the flags (i.e. should there be a contextTimeout which applies to all controllers everywhere? Is that useful?)

0 commit comments

Comments
 (0)