Skip to content
This repository was archived by the owner on Jul 11, 2023. It is now read-only.

feat(cert): use intent to set issuers during root cert rotation #5201

Merged
merged 9 commits into from
Nov 7, 2022

Conversation

jaellio
Copy link
Contributor

@jaellio jaellio commented Oct 13, 2022

Description:

The PR includes the changes required to support an MVP of manual root certificate rotation. With these changes, a user will be able to manually create and update the intent of the MRCs and rotate the root certificate without downtime. OSM utilizes a level-triggered approach to bring the state of the mesh to match the MRC intent.

There are 3 possible values for intent. Each describes the desired role of the provider and certificate settings specified in the MRC:

  • active: the certificate provider issues and signs certificates and the ca_bundle for that provider is included in the validation context
  • passive: the certificate provider is not used to issue or sign certificates and the ca_bundle for the provider is included in the validation context
  • inactive: the certificate provider is not used to issue or sign certificates and it is not included in the validation context

To rotate the root certificate or update the certificate provider, the user must perform the following actions:
There is an existing MRC, MRC1 and a new MRC will be created, MRC2

  1. create a new MRC with inactive intent -> MRC1.intent = active, MRC2.intent = inactive
  2. update the intent of the new MRC to passive. Wait to allow the certificates to be rotated -> MRC1.intent = active, MRC2.intent = passive
  3. update the intent of the new MRC to active -> MRC1.intent = active, MRC2.intent = active
  4. update the intent of the old MRC to passive. Wait to allow the certificates to be rotated -> MRC1.intent = passive, MRC2.intent = active
  5. update the intent of the old MRC to inactive -> MRC1.intent = inactive, MRC2.intent = active

Note: All combinations of intent are deterministic except for active and active. The order in which the MRCs are listed determines which will be the signingIssuer or the validatingIssuer.

Initial proposal for root certificate rotation simplification: https://docs.google.com/document/d/1bh3Yfn5DXWjmzdKa4XHlk-9KepUNb_EgDDrzd1G_deY/edit?usp=sharing

Resolves #5198
Part of #4998

Testing done:

  • e2es testing rotation with tresor, cert-manager, and vault
  • unit tests

Affected area:

Functional Area
New Functionality [ x ]
Certificate Management [ x ]

Please answer the following questions with yes/no.

  1. Does this change contain code from or inspired by another project? No

    • Did you notify the maintainers and provide attribution?
  2. Is this a breaking change? No

  3. Has documentation corresponding to this change been updated in the osm-docs repo (if applicable)? No

@jaellio jaellio added the wip Work-in-Progress label Oct 13, 2022
@jsturtevant
Copy link
Contributor

I was able to rotate a new certificate in and remove the old one 🚀

@jsturtevant
Copy link
Contributor

one note:

k get meshrootcertificates.config.openservicemesh.io    
NAME                                          STATE
osm-mesh-root-certificate-2  

maybe we can make that

k get meshrootcertificates.config.openservicemesh.io
NAME                                                          Intent
osm-mesh-root-certificate-2                      active (or passive or etc)

@jsturtevant
Copy link
Contributor

After playing with it we landed on the following pattern:

Step MRC1 old MRC2 new signer validator
1 active mcr1 mcr1
2 active passive mcr1 mcr2
3 active active mcr2 or mrc1 mcr1 or mrc2
4 passive active mcr2 mcr1
5 inactive active mcr2 mcr2
5 active mcr2 mcr2

name: State
- description: Intent for the MeshRootCertificate
jsonPath: .spec.intent
name: Intent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I have been playing around with the MRC for the e2e tests and been thinking this field name might be better as Role. for example:

"Moving MRC to Passive intent in cluster\n" compared to "Moving MRC [%s] -> Passive role cluster"

"The MRC is in the Active role in the cluster" compared to "The MRC is in the Active intent in the cluster"

Thoughts? @jaellio @keithmattix @steeling

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with that; this is the time for naming changes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya that sounds good to me.

Copy link
Contributor Author

@jaellio jaellio Oct 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay! I will make the update in a follow up PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Role makes sense since we're ultimately representing what the certificate can and cannot do.

@jaellio jaellio removed the wip Work-in-Progress label Oct 17, 2022
@jaellio jaellio marked this pull request as ready for review October 17, 2022 19:44
@jaellio jaellio force-pushed the useIntentForIssuers branch 3 times, most recently from d845887 to 81d872d Compare October 18, 2022 00:08
@jsturtevant
Copy link
Contributor

The cert rotation tests are passing:

• [SLOW TEST:306.823 seconds]
[Tier 2][Bucket 11][] MeshRootCertificate
/home/runner/work/osm/osm/tests/framework/common.go:92
  with Tressor
  /home/runner/work/osm/osm/tests/e2e/e2e_meshrootcertificate_test.go:29
    rotates certificates
• [SLOW TEST:220.601 seconds]
[Tier 2][Bucket 11][] MeshRootCertificate
/home/runner/work/osm/osm/tests/framework/common.go:92
  with CertManager
  /home/runner/work/osm/osm/tests/e2e/e2e_meshrootcertificate_test.go:35
    rotates certificates
• [SLOW TEST:207.898 seconds]
[Tier 2][Bucket 11][] MeshRootCertificate
/home/runner/work/osm/osm/tests/framework/common.go:92
  with Vault
  /home/runner/work/osm/osm/tests/e2e/e2e_meshrootcertificate_test.go:41
    rotates certificates
Ran 3 of 67 Specs in 737.334 seconds
SUCCESS! -- 3 Passed | 0 Failed | 0 Pending | 64 Skipped

@@ -490,7 +450,7 @@ func buildMeshRootCertificate(presetMeshRootCertificateConfigMap *corev1.ConfigM
APIVersion: "config.openservicemesh.io/configv1alpha2",
},
ObjectMeta: metav1.ObjectMeta{
Name: meshRootCertificateName,
Name: constants.DefaultMeshRootCertificateName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ensures that the default mesh root certificate always exists. but which essentially prevents users from deleting it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a change in behavior for this PR, this was a local constant (see line 52 above) and I moved it to the constant so it would be in one place to be used across tests as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add a TODO at least?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steeling Could you explain more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting isn't blocked but upon bootstrapper restart we will see issues. I created #5237 to track. @ksubrmnn hopefully the issue helps explains and proposes a few options for solving.


// PassiveIntent means the settings and certificate provider in this MRC are used for validating
// certificates.
PassiveIntent MeshRootCertificateIntent = "passive"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"passive means its ued for validating", maybe it should be called validateOnly?

Non-blocking comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally like passive but if there is strong agreement here, we should switch active to something like signing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ValidateOnly and Signing make understanding the code a bit more straightforward for newcomers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was initially leaning this way as well, but in that case does signing mean "signingAndValidating" since its doing both? The ambiguity and verbosity felt to much. For this reason, after sitting on it for awhile and trying out several rotations I began to like passive and active

@jaellio you presented this at the servicemeshcon did you hear any feedback in either direction?


var (
validity = time.Hour
osmConfigClient "github.com/openservicemesh/osm/pkg/gen/client/config/clientset/versioned"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imported/aliased package names should not contain uppercases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Is this in the uber style guide?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a linter that enforced this but I found this which has a little dialog https://stackoverflow.com/a/20864492/697126

)

// fakeMRCClient implements the MRCClient interface
type fakeMRCClient struct{}
type fakeMRCClient struct {
configClient osmConfigClient.Interface
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're adding this here, we may as well remove this altogether and use the MRCComposer, which has an interface for the compute client. We can leverage a fake or mock compute client for that.

Copy link
Contributor Author

@jaellio jaellio Oct 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fakeMRCClient in this pkg exists for use in tests to avoid a circular dependency with pkg/certificate/providers. Since the MRCComposer is also defined in pkg/certificate/providers this would result in a circular dependency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's likely indicative of a separate issue then. maybe we should bring MRCComposer into the certificate package, or into it's own subpackage so it can be used in both packages independently

type UseCase string
func (m *Manager) handleMRCEvent(event MRCEvent) error {
log.Debug().Msgf("handling MRC event for MRC %s", event.MRCName)
// TODO(#5226): optimize event handling to reduce cost of listing all MRCs for each event
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we could simply keep the MRC on the issuer struct, and then we have the 2 mrc's we care about. This seems like a simple enough change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I chatted with @jaellio last week, it got a bit complicated on the edge cases I believe. We've got #5226 to track. If we want this in before release we can work can follow up there, but I would be in favor of doing that separately so we can unblock the other PRs that are waiting on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I won't block, but I'll point it that it looks like we could do a drop in replacement into this code:

...
   desiredSigning, desiredValidating := m.getSigningAndValidating(event.MRC)
   // note: no need to filter out inactive anymore, or call ListMeshRootCertificates
...

func getSigningAndValidating(newMRC MRC) {
  mrcs := make(map[string]MRC)
  mrcs[m.signingIssuer.ID] = m.signingIssuer.mrc
  mrcs[m.validatingIssuer.ID] = m.validatingIssuer.mrc
  mrcs[newMRC.Name] = newMRC
  if len(mrcs) > 2 {.../*err*/ }
 
   ... // continue as normal
}

I think using a map like this would also clean up some of the logic in that method. But again, non-blocking

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example of a case where this gets tricky is when a controller restarts. In that case the existing manager won't have any sign/validating issuers set. By approaching it this way we are handling the events in a trigger level fashion.

This means that on the first event that comes through the issuer/validator would be the same but this might not be the case, there could be more than one Active/Passive. In this case there are some conditions where the controller could start to re-issue certificates since the ShouldRotate function would see this as a change. Once more events came through it would be normalize but this could be a spot where we have strange behavior intermittently.

Using a custom indexer (#5226) that only contains active MRCs we could be more level triggered and solve this problem of having to filter and list all the MRCS.

Maybe we could combine the two options as well, which would allow us to exit earlier in the case things are set?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds simpler then to call list exactly one time at startup, then use the events for the remainder of the lifetime of the manager?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking at this point

transitionAfter := mrc.Status.TransitionAfter
if transitionAfter == nil {
return false
if mrc1 == nil || mrc2 == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: i don't think this possible, but defense in depth is not necessarily a bad thing.

I'd consider a validateMRC function, which has this check, and others like ValidateIntent. This seems like it would be useful in general, and would allow greater re-use in ie: a validating webhook

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this logic could be combined in a validateMRC method. This would help reduce duplicate code between the validating webhook for the mrc and the mrc reconciler. I think it makes sense to update this in #5220 when it more clear what the commonalities are for mrc validation in the webhook and the reconciler.

type UseCase string
func (m *Manager) handleMRCEvent(event MRCEvent) error {
log.Debug().Msgf("handling MRC event for MRC %s", event.MRCName)
// TODO(#5226): optimize event handling to reduce cost of listing all MRCs for each event
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I won't block, but I'll point it that it looks like we could do a drop in replacement into this code:

...
   desiredSigning, desiredValidating := m.getSigningAndValidating(event.MRC)
   // note: no need to filter out inactive anymore, or call ListMeshRootCertificates
...

func getSigningAndValidating(newMRC MRC) {
  mrcs := make(map[string]MRC)
  mrcs[m.signingIssuer.ID] = m.signingIssuer.mrc
  mrcs[m.validatingIssuer.ID] = m.validatingIssuer.mrc
  mrcs[newMRC.Name] = newMRC
  if len(mrcs) > 2 {.../*err*/ }
 
   ... // continue as normal
}

I think using a map like this would also clean up some of the logic in that method. But again, non-blocking


// if the issuer has already been created for the specified MRC,
// return the existing issuer
if signingIssuer != nil && mrc.Name == signingIssuer.ID {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we have this check here, then we can be less defensive about the shouldUpdateIssuer, and remove the checks on whether the desired is already set or not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think shouldUpdateIssuer should keep the check if the issuers are already set to their desired values. If this condition is true, the issuers should not be updated and therefore shouldUpdateIssuer should return false.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this handles the scenario desiredSigningMRC != desiredValidatingMRC and both MRCs have active intents, their state is non-deterministic. ?

Even if it does, I kind of like being explicit about though and exiting the reconciliation early. It isn't immediately obvious that there are no changes otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave a quick thought on this extra defense in-depth:

You can find a lot of literature around unit testing methods in go, which typically mention there's no need to test unexported funcs. Expanding on this, we can think of unexported funcs as internals that are subject to change, and wholly within the control of the developers, while exported methods should be tested, because they define the API of our package, and can be used throughout the codebase.

Indeed, if a method becomes used frequently enough it can become difficult to change its behavior, and with many developers working on a project there tends to be specialization among packages, so good API's become more prudent as a dev workign on package A may not be familiar with the internals of package B. This is less true for unexported methods which are contained within the package.

Similarly, an exported method should be more defensive about parameter sanitization, while an internal method less so.

So without the need to add extra defense in-depth, we're left with the question of should we? There's 2 considerations here: readability, and defense from ourselves.

On one hand, the extra defense in-depth may hurt readability. It tells our future readers that a value is possible, and we should protect from it, even when it is currently not possible (again we don't have these guarantees at the API level). It creates an extra case in our state machine that we have to hold in our mental model.

On the other hand, we protect from our future selves by limiting the chance of a future caller not sanitizing inputs correctly (however if we are sanitizing at the API level, then any internal calls should be safe)

Personally, I err on the side that we read code much more than we write it, so we should aim for readability, but I'll defer to others on this.

Sorry for the long text on a mostly benign topic :)

}
if validatingIssuer != nil && mrc.Name == validatingIssuer.ID {
return validatingIssuer, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may also want to check if we are promoting/demoting an MRC, to avoid an extra call to GetCertIssuerForMRC, if we're swapping the issuers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this case is already handled by the existing conditionals. For a given MRC we check if the name matches the ID of either issuer and return the issuer on a match.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah gotcha, sounds like a good thing for a test case :)

@jaellio jaellio force-pushed the useIntentForIssuers branch from dc33a90 to 1f029bf Compare November 1, 2022 00:32
jaellio and others added 8 commits November 2, 2022 15:42
Signed-off-by: jaellio <[email protected]>

refactored mrc reconciler methods

Signed-off-by: jaellio <[email protected]>

tests

Signed-off-by: Kalya Subramanian <[email protected]>

Begin work on e2es update logging

Signed-off-by: jaellio <[email protected]>

Response to intial PR feedback
- removed deacative as an intent

Signed-off-by: jaellio <[email protected]>

Finish tests to validate basic rotation

Signed-off-by: James Sturtevant <[email protected]>

Use active/active pattern

Signed-off-by: James Sturtevant <[email protected]>

- handle single MRC case in each method rather than seperately
- shouldSetIssuers check
- non-deterministic active, active intent check to avoid
  repeated updates to issuers

Signed-off-by: jaellio <[email protected]>

- test for filtering inactive MRCs

Signed-off-by: jaellio <[email protected]>

- simplified logic for handling 1 MRC in control plane
- added more unit tests
- moved HandleMRCEvent unit test to mrc_reconciler_test.go

Signed-off-by: jaellio <[email protected]>

- clean up lint errors and commented out code

Signed-off-by: jaellio <[email protected]>

- remove duplicate switch case in shouldSetIssuer and setIssuer

Signed-off-by: jaellio <[email protected]>
have active intents, and the issuers are not set to the desired
values, but match either MRC, then the issuers do not need to be
updated

Signed-off-by: jaellio <[email protected]>
- simplified validaion of MRC intent and their combinations
- simplified switch state to two conditional checks to determine
  validating and signing MRCs
- removed unused MRC constants from constants.go
- removed noop status update in osm bootstrap

Signed-off-by: jaellio <[email protected]>
- remove validateMRCIntent

Signed-off-by: jaellio <[email protected]>
@jaellio jaellio force-pushed the useIntentForIssuers branch from 7f5e44e to 1fae622 Compare November 2, 2022 22:42
@jaellio jaellio force-pushed the useIntentForIssuers branch from dd57e57 to bf5f16d Compare November 2, 2022 23:01
@jsturtevant
Copy link
Contributor

@keithmattix a few changes have been made since your last look, would you take another look?

@keithmattix
Copy link
Contributor

Thanks for the ping! Will take a look!

}
return nil
return mrcList[:n]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this protecting against potentially 3 active MRC's? That seems like a critical issue and we should return an error?

func isTerminal(mrc *v1alpha2.MeshRootCertificate) bool {
// TODO(5046): check if the mrc is in a terminal state.
return true
func (m *Manager) updateIssuers(signingIssuer, validatingIssuer *issuer) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function doesn't need to return an error

Copy link
Contributor

@keithmattix keithmattix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit, but nothing worth blocking this PR on. I'd prefer to bias towards action and get this merged and iterate via separate PRs. Nice work!

@@ -136,6 +107,7 @@ var testPresetMeshRootCertificate = &corev1.ConfigMap{
},
Data: map[string]string{
presetMeshRootCertificateJSONKey: `{
"intent": "Active",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the casing probably matters here; should be active instead of Active

@jaellio jaellio merged commit 376a826 into openservicemesh:main Nov 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
6 participants