|
| 1 | +# Increasing Timeouts for Issuer Operations |
| 2 | + |
| 3 | +- [Release Signoff Checklist](#release-signoff-checklist) |
| 4 | +- [Summary](#summary) |
| 5 | +- [Motivation](#motivation) |
| 6 | + - [Goals](#goals) |
| 7 | + - [Non-Goals](#non-goals) |
| 8 | +- [Proposal](#proposal) |
| 9 | + - [Risks and Mitigations](#risks-and-mitigations) |
| 10 | +- [Design Details](#design-details) |
| 11 | + - [Test Plan](#test-plan) |
| 12 | + - [Graduation Criteria](#graduation-criteria) |
| 13 | + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) |
| 14 | + - [Supported Versions](#supported-versions) |
| 15 | +- [Production Readiness](#production-readiness) |
| 16 | +- [Drawbacks](#drawbacks) |
| 17 | +- [Alternatives](#alternatives) |
| 18 | + |
| 19 | +## Release Signoff Checklist |
| 20 | + |
| 21 | +This checklist contains actions which must be completed before a PR implementing this design can be merged. |
| 22 | + |
| 23 | + |
| 24 | +- [ ] This design doc has been discussed and approved |
| 25 | +- [ ] Test plan has been agreed upon and the tests implemented |
| 26 | +- [ ] Feature gate status has been agreed upon (whether the new functionality will be placed behind a feature gate or not) |
| 27 | +- [ ] Graduation criteria is in place if required (if the new functionality is placed behind a feature gate, how will it graduate between stages) |
| 28 | +- [ ] User-facing documentation has been PR-ed against the release branch in [cert-manager/website] |
| 29 | + |
| 30 | + |
| 31 | +## Summary |
| 32 | + |
| 33 | +Users of several different issuers (including some deployments of Venafi ([5108][], [4893][]), and ACME |
| 34 | +issuers ([4452][]) such as Sectigo ([comment][4893-comment]) and ZeroSSL ([comment][website-583-comment]) have been running |
| 35 | +into timeout errors which manifest as `context deadline exceeded` errors in logs. |
| 36 | + |
| 37 | +These errors are caused because the relevant cert-manager controllers hardcode short timeouts into their sync/reconcile |
| 38 | +functions, and some operations which must be done during a sync/reconcile can take longer than that short timeout. |
| 39 | + |
| 40 | +This manifests during issuer creation and during certificate issuance. Put simply: for some issuers, initialisation or |
| 41 | +certificate issuance just takes longer than the maximum amount of time we currently allow. |
| 42 | + |
| 43 | +[5108]: https://github.com/cert-manager/cert-manager/issues/5108 |
| 44 | +[4893]: https://github.com/cert-manager/cert-manager/issues/4893 |
| 45 | +[4452]: https://github.com/cert-manager/cert-manager/issues/4452 |
| 46 | +[4893-comment]: https://github.com/cert-manager/cert-manager/issues/4893#issuecomment-1115143729 |
| 47 | +[website-583-comment]: https://github.com/cert-manager/website/issues/583#issuecomment-1011664334 |
| 48 | + |
| 49 | +## Motivation |
| 50 | + |
| 51 | +Making a change here is justified since there are users today who want to use certain standards-compliant ACME issuers are |
| 52 | +who're unable to do so with cert-manager, through no fault of their own. |
| 53 | + |
| 54 | +There's likely to be strong demand for this; in the various issues linked in the summary above, we've seen a huge amount |
| 55 | +of engagement through reactions to posts (which isn't a perfect indicator of demand but these issues are unusually |
| 56 | +popular relative to what we normally see). |
| 57 | + |
| 58 | +This design will largely talk about ACME since ACME issuer users are almost certainly by far the biggest section of the |
| 59 | +current cert-manager userbase, but the principles here apply equally to the Venafi issuer, where an instance of |
| 60 | +Venafi TPP might be deployed on-prem and could be slow for any number of reasons. We should address Venafi issuers in |
| 61 | +a similar way, but in a separate piece of work. |
| 62 | + |
| 63 | +### Goals |
| 64 | + |
| 65 | +#### Remain Good Cluster Citizens |
| 66 | + |
| 67 | +The justification for a short default timeout on each sync/reconcile operation is that we avoid a situation whereby we |
| 68 | +create many "hanging" threads which could overload the node on which cert-manager is running. |
| 69 | + |
| 70 | +The simplest solution to this issue would be to simply remove all timeouts or to make them all very long (say, 1 hour), |
| 71 | +but that could lead to high resource use. |
| 72 | + |
| 73 | +#### Merge Before 1.9 and Backport |
| 74 | + |
| 75 | +This timeout issue is present in all currently supported versions of cert-manager (1.7 and 1.8 at the time of writing). |
| 76 | +Ideally we'd like to get this fix merged before we release 1.9 (currently scheduled for Jul 06, 2022) and to backport |
| 77 | +to other supported releases, so that we'll be able to support ACME issuers like Sectigo in multiple versions. |
| 78 | + |
| 79 | +### Non-Goals |
| 80 | + |
| 81 | +#### Individually Configurable Timeouts for Every Operation or API Call |
| 82 | + |
| 83 | +Having such timeouts might be attractive at some point, but specifically we want a broader approach for this solution |
| 84 | +pending some evidence that more specific timeout configuration is a thing that people actually want. |
| 85 | + |
| 86 | +#### Adding New Timeouts Where We Don't Currently Have Them |
| 87 | + |
| 88 | +While we should almost certainly have an upper-limit timeout for every single controller to prevent runaway resource |
| 89 | +usage through infinitely hanging sync operations, it would be preferable to keep the scope of this PR minimal so that |
| 90 | +it'll pass the threshold for simplicity to be backported. |
| 91 | + |
| 92 | +## Proposal |
| 93 | + |
| 94 | +Note that this proposal is intentionally kept simple in terms of the scope of what we should actually change. |
| 95 | + |
| 96 | +This is because we take an _extremely_ conservative approach to backports where we seek to absolutely minimise the |
| 97 | +amount and complexity of changes we introduce to already released versions. |
| 98 | + |
| 99 | +Hence this proposal will focus on a simple approach that would apply everywhere, and which we could build upon in any |
| 100 | +future releases if we so chose. |
| 101 | + |
| 102 | +Note also that the specific numeric values for timeouts are given and justified later after the changes are proposed. |
| 103 | +This makes the timeouts easier to change. Timeouts are instead given Greek letter names such as `α` and `β`. |
| 104 | + |
| 105 | +An executive summary of the changes would be to remove a few middleware timeouts and then increase the hardcoded |
| 106 | +defaults we currently have in a few places. Really all of the below is just saying that in long form. The justification |
| 107 | +is in the [Design Details](#design-details) section below. |
| 108 | + |
| 109 | +### Increase Some Existing Controller Timeouts |
| 110 | + |
| 111 | +Note that links in this section are for the `release-1.8` branch but these changes would also apply similarly to our |
| 112 | +other currently-supported version of cert-manager (v1.7) and also for the `master` branch. |
| 113 | + |
| 114 | +We propose to increase the existing timeouts for the [issuers][] and [clusterissuers][] controllers to |
| 115 | +`α` seconds. |
| 116 | + |
| 117 | +[issuers]: https://github.com/cert-manager/cert-manager/blob/2918f68664be98f93b8092afbbc503e42d9b7588/pkg/controller/issuers/sync.go#L45 |
| 118 | +[clusterissuers]: https://github.com/cert-manager/cert-manager/blob/2918f68664be98f93b8092afbbc503e42d9b7588/pkg/controller/clusterissuers/sync.go#L45 |
| 119 | + |
| 120 | +### Remove Middleware Timeouts |
| 121 | + |
| 122 | +We propose to remove the context timeouts we currently have in our ACME logging middleware; that would look like |
| 123 | +(or be) [this unmerged commit][b1953]. |
| 124 | + |
| 125 | +The affected controllers appear to be those which have an `accounts.Getter`: |
| 126 | + |
| 127 | +- [acmechallenges](https://github.com/cert-manager/cert-manager/blob/c16b3cca7b418ba0d0b2bf1066514b8762984517/pkg/controller/acmechallenges/controller.go#L48) |
| 128 | +- [acmeorders](https://github.com/cert-manager/cert-manager/blob/c16b3cca7b418ba0d0b2bf1066514b8762984517/pkg/controller/acmeorders/controller.go#L50) |
| 129 | + |
| 130 | +These timeouts have two issues. One is that the location they're added is unintuitive; the timeouts are added in |
| 131 | +_logging_ middleware which which doesn't otherwise mention that it also introduces timeouts. |
| 132 | + |
| 133 | +That's confusing; we might reasonably expect a timeout on writing the logs themselves (i.e. the actual operation of |
| 134 | +writing to a log) but this functionality doesn't manage that. |
| 135 | + |
| 136 | +The second issue is that these timeouts effectively duplicate HTTP client timeouts. |
| 137 | + |
| 138 | +HTTP client timeouts belong on the underlying HTTP client; that's where we could set more finegrained controls such as |
| 139 | +TLS handshake, dialer and overall HTTP request timeouts. HTTP client timeouts are [desirable](https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/), |
| 140 | +and we [actually already have them](https://github.com/cert-manager/cert-manager/blob/e116d416f3b14863d05753739cbdf72d66923357/pkg/acme/accounts/client.go#L58-L75) |
| 141 | +for our ACME clients. |
| 142 | + |
| 143 | +It's worth noting that if we simply removed the logging middleware timeout, the HTTP client timeouts would start to |
| 144 | +take effect. They're longer than the context timeout and so wouldn't ever be relevant on cert-manager today, but would |
| 145 | +be relevant without the middleware context timeout. We also propose below to tweak those HTTP timeouts in any case. |
| 146 | + |
| 147 | +[b1953]: https://github.com/cert-manager/cert-manager/pull/5206/commits/b195382a67c5e548752407d12851b84b4d6b8e28 |
| 148 | + |
| 149 | +### Increase ACME HTTP Timeouts |
| 150 | + |
| 151 | +We propose to update the overall timeout for our HTTP clients for ACME requests to `β` seconds and - at least for now - |
| 152 | +**not** to make this configurable by users. |
| 153 | + |
| 154 | +As mentioned above, we already have HTTP timeouts on the HTTP clients we build for use with ACME clients, as seen |
| 155 | +[here](https://github.com/cert-manager/cert-manager/blob/e116d416f3b14863d05753739cbdf72d66923357/pkg/acme/accounts/client.go#L58-L75). |
| 156 | + |
| 157 | +The dialer and TLS handshake timeouts are set to 30 and 10 seconds respectively, and both are likely fine to keep as |
| 158 | +they are and in any case unlikely to be a problem for people experiencing the issues detailed in [#5080](https://github.com/cert-manager/cert-manager/issues/5080). |
| 159 | + |
| 160 | +The overall HTTP timeout (see [this blog](https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/) |
| 161 | +for a diagram showing the different client timeouts) is [set to 30s](https://github.com/cert-manager/cert-manager/blob/e116d416f3b14863d05753739cbdf72d66923357/pkg/acme/accounts/client.go#L73) |
| 162 | +which is too short to address the issues we're trying fix here. |
| 163 | + |
| 164 | +### Risks and Mitigations |
| 165 | + |
| 166 | +The risk here is that the timeouts we have already, as currently configured, might _already_ be the last line of defence |
| 167 | +for some controllers in some clusters. By that we mean that by increasing the timeout we could pass some critical |
| 168 | +threshold where pods start to run out of resources and crash. |
| 169 | + |
| 170 | +This could be in a cluster where someone is trying to create many slow-to-create Issuers at once, or to issue many |
| 171 | +Certificates at once from issuers which are slow, or to issue many Certificates from a single issuer which becomes |
| 172 | +overloaded and slows down. |
| 173 | + |
| 174 | +## Design Details |
| 175 | + |
| 176 | +Most of the above proposals involve tweaking integers or else they have [a commit](https://github.com/cert-manager/cert-manager/pull/5206/commits/b195382a67c5e548752407d12851b84b4d6b8e28) |
| 177 | +which can be viewed. The actual PR here won't be difficult to understand - that's intentional, to minimise complexity |
| 178 | +of a change we want to backport. |
| 179 | + |
| 180 | +The proposed value for `β` - the ACME HTTP timeout - is 90s. That's longer than the 60s which seems to be around the |
| 181 | +upper limit of what people have reported seeing with Sectigo, which seems to be on the slower side. |
| 182 | + |
| 183 | +The proposed value for `α` - the context timeout - is 120s. This is long enough to fit in a slow ACME request and then |
| 184 | +do other work relating to issuance. |
| 185 | + |
| 186 | +### Justification |
| 187 | + |
| 188 | +#### Crossplane |
| 189 | + |
| 190 | +The above values are mostly inspired by crossplane.io, which uses context timeouts very heavily in basically all |
| 191 | +controllers (which is something that cert-manager should likely aspire to). |
| 192 | + |
| 193 | +Here's a selection of timeouts in various crossplane controllers: |
| 194 | + |
| 195 | +| Timeout | Reconciliation loop | |
| 196 | +|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
| 197 | +| 2 mins | [definition/reconciler.go#L55](https://github.com/crossplane/crossplane/blob/60fc7df4b3c9d0f11e9ea719b63b52bec56d3853/internal/controller/apiextensions/definition/reconciler.go#L55) | |
| 198 | +| 2 mins | [definition/reconciler.go#L43](https://github.com/crossplane/crossplane/blob/134ec72e0a5649147bb95e76d063d3df8c0b4e5f/internal/controller/rbac/definition/reconciler.go#L43) | |
| 199 | +| 2 mins | [composition/reconciler.go#L45](https://github.com/crossplane/crossplane/blob/134ec72e0a5649147bb95e76d063d3df8c0b4e5f/internal/controller/apiextensions/composition/reconciler.go#L45) | |
| 200 | +| 2 mins | [binding/reconciler.go#L46](https://github.com/crossplane/crossplane/blob/134ec72e0a5649147bb95e76d063d3df8c0b4e5f/internal/controller/rbac/provider/binding/reconciler.go#L46) | |
| 201 | +| 2 mins | [namespace/reconciler.go#L43](https://github.com/crossplane/crossplane/blob/134ec72e0a5649147bb95e76d063d3df8c0b4e5f/internal/controller/rbac/namespace/reconciler.go#L43) | |
| 202 | +| 2 mins | [roles/reconciler.go#L45](https://github.com/crossplane/crossplane/blob/134ec72e0a5649147bb95e76d063d3df8c0b4e5f/internal/controller/rbac/provider/roles/reconciler.go#L45) | |
| 203 | +| 2 mins | [composite/reconciler.go#L45](https://github.com/crossplane/crossplane/blob/dd23304b466690f89a82b57825ca8a9870767972/internal/controller/apiextensions/composite/reconciler.go#L45) | |
| 204 | +| 1 min | [offered/reconciler.go#L53](https://github.com/crossplane/crossplane/blob/2c9872543c5074928ba739e147b02e09cd1f3090/internal/controller/apiextensions/offered/reconciler.go#L53) | |
| 205 | +| 1 min | [claim/reconciler.go#L46](https://github.com/crossplane/crossplane/blob/dd23304b466690f89a82b57825ca8a9870767972/internal/controller/apiextensions/claim/reconciler.go#L46) | |
| 206 | +| 1 min | [manager/reconciler.go#L45](https://github.com/crossplane/crossplane/blob/1344a86018f03fcbba6b9365c63696dcd47ebcf6/internal/controller/pkg/manager/reconciler.go#L45) | |
| 207 | +| 1 min | [resolver/reconciler.go#L47](https://github.com/crossplane/crossplane/blob/243f1f47ca0b17503b08ef77770f41a89defeb82/internal/controller/pkg/resolver/reconciler.go#L47) | |
| 208 | +| 3 mins | [revision/reconciler.go#L54](https://github.com/crossplane/crossplane/blob/1344a86018f03fcbba6b9365c63696dcd47ebcf6/internal/controller/pkg/revision/reconciler.go#L54) | |
| 209 | + |
| 210 | +Crossplane also has an [open issue](https://github.com/crossplane/crossplane/issues/2564) relating to making this value |
| 211 | +configurable. |
| 212 | + |
| 213 | +The idea here is "if it's good enough for crossplane why should it not be good enough for us?" |
| 214 | + |
| 215 | +The current cert-manager timeouts are arbitrary. Likely the crossplane timeouts are also arbitrary. We can at least |
| 216 | +have confidence that a big project with a tonne of controllers and CRDs is using longer timeouts and clearly not seeing |
| 217 | +world-ending problems, and people want to _increase_ the timeouts from that base too, as envidenced by the above open |
| 218 | +issue. |
| 219 | + |
| 220 | +Another relevant timeout is certbot, which has a [45s](https://github.com/certbot/certbot/blob/295fc5e33a68c945d2f62e84ed8e6aaecfe93102/acme/acme/client.py#L46) |
| 221 | +limit on network requests by default and a 90s limit on polling after the fact. We go higher here because people have |
| 222 | +reported their requests taking longer than 45 |
| 223 | + |
| 224 | +#### Kubernetes |
| 225 | + |
| 226 | +In investigation we couldn't find any evidence that there are timeouts used in kubernetes controllers such as the Deployments |
| 227 | +controller. That implies they don't think this topic is particularly scary, which in turn provides some comfort that picking |
| 228 | +a higher timeout isn't likely to bring about the end of the world. |
| 229 | + |
| 230 | +#### User Reports |
| 231 | + |
| 232 | +User reports of time taken vary slightly: |
| 233 | + |
| 234 | +- One user reported times of around 30s: https://github.com/cert-manager/cert-manager/issues/4452#issuecomment-1144561276 |
| 235 | +- Another reported similar: https://github.com/cert-manager/cert-manager/issues/4452#issuecomment-1143889347 |
| 236 | +- One said "at least 60s, sometimes more": https://github.com/cert-manager/cert-manager/issues/5080#issuecomment-1137259557 |
| 237 | + |
| 238 | +Since the largest time is 'sometimes more than 60s', our lowest timeout should likely be at least 60s. |
| 239 | + |
| 240 | +#### General Justifications |
| 241 | + |
| 242 | +There's a risk with work like this that we can focus too much on what's already present. The 10s timeout which is set for these |
| 243 | +controllers on `master` was committed with absolutely no justification we can find in the PR, commit message, design doc or anywhere else. |
| 244 | + |
| 245 | +We should the same amount of attachment to that 10 second timeout as there was justification given for it when it was added originally - absolutely none. |
| 246 | + |
| 247 | +That's not to say it was a mistake or bad to add these timeouts originally - the controllers which have them are better than almost every other controller |
| 248 | +we have just by virtue of having a timeout at all! |
| 249 | + |
| 250 | +### Test Plan |
| 251 | + |
| 252 | +As far as we can tell, none of this is tested currently, nor would we plan to add any new tests in the PR. |
| 253 | + |
| 254 | +Down the road it would be great to add some tests for this stuff but that feels like a bigger project than this short |
| 255 | +term bug fix. |
| 256 | + |
| 257 | +### Graduation Criteria |
| 258 | + |
| 259 | +There's nothing to feature gate and no graduation required. |
| 260 | + |
| 261 | +### Upgrade / Downgrade Strategy |
| 262 | + |
| 263 | +A user upgrading will notice no changes except that the failing ACME issuers they currently have should start working. |
| 264 | + |
| 265 | +A user downgrading to an unsupported version (1.6 and below) could start to see timeout issues since the timeouts in |
| 266 | +unsupported releases will be lower than those in 1.7 and above. |
| 267 | + |
| 268 | +### Supported Versions |
| 269 | + |
| 270 | +cert-manager 1.7+ |
| 271 | + |
| 272 | +K8s support irrelevant |
| 273 | + |
| 274 | +## Production Readiness |
| 275 | + |
| 276 | +### How can this feature be enabled / disabled for an existing cert-manager installation? |
| 277 | + |
| 278 | +This will take effect by default in all versions and won't be able to be disabled. |
| 279 | + |
| 280 | +### Does this feature depend on any specific services running in the cluster? |
| 281 | + |
| 282 | +Yes, cert-manager |
| 283 | + |
| 284 | +### Will enabling / using this feature result in new API calls (i.e to Kubernetes apiserver or external services)? |
| 285 | + |
| 286 | +This change will result in fewer API calls to external services and to Kubernetes. Since issuance on slow ACME issuers |
| 287 | +will be more likely to succeed, we won't have long retry loops where we'll continually retry and get timed out. |
| 288 | + |
| 289 | +Put another way: instead of 20 requests where 19 of them time out and one succeeds, we'll just have one request that |
| 290 | +succeeds. |
| 291 | + |
| 292 | +### Will enabling / using this feature result in increasing size or count of the existing API objects? |
| 293 | + |
| 294 | +Potentially fewer Orders and Challenges, since we'll fail less often. |
| 295 | + |
| 296 | +### Will enabling / using this feature result in significant increase of resource usage? (CPU, RAM...) |
| 297 | + |
| 298 | +If there's a massive queue of blocked requests there could be some increase in resource usage. Hard to say how it would |
| 299 | +change — if anything there's likely to be a _decrease_ in total resource usage since longer-running requests will |
| 300 | +succeed and we won't need to pay the cost of running a whole Sync function again after an ACME request times out. |
| 301 | + |
| 302 | +## Drawbacks |
| 303 | + |
| 304 | +Once we bump the default max timeout we can't realistically reduce the default again, since that could be perceived as |
| 305 | +backwards-incompatible. |
| 306 | + |
| 307 | +To reduce timeouts we'd likely need to add a flag for configuring the timeouts and let cluster admins make the choice. |
| 308 | + |
| 309 | +## Alternatives |
| 310 | + |
| 311 | +The obvious alternative is to add a flag or multiple flags to control these timeouts. That was rejected because it's a |
| 312 | +larger change and we want to backport these changes. |
| 313 | + |
| 314 | +Another option is to increase the hardcoded timeouts everywhere _and_ add a flag to configure this on master, so that |
| 315 | +the fix is applied on older branches but we have the functionality of a flag. This was rejected because it's more work |
| 316 | +and we don't have any evidence that we need to add a flag. We'd also have to face the difficulty of deciding how to |
| 317 | +scope the flags (i.e. should there be a contextTimeout which applies to all controllers everywhere? Is that useful?) |
0 commit comments