Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Allow more permissive extensibility for securityRules and additional CP LoadBalancers #5525

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

Danil-Grigorev
Copy link
Member

@Danil-Grigorev Danil-Grigorev commented Mar 28, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

Implement AdditionalControlPlaneLBPorts field for the additional inbound control-plane load balancer port configuration, allowing external access for user specified ports on the CP machine.

additionalAPIServerLBPorts:
- name: RKE2
   port: 9345

This change also improves UX while using CAPZ with RKE2 provider, by allowing to specify only additional security rules on the AzureCluster resource, and making existing security rule defaults for APIServer and SSH port more permissive and based on destination port selection.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #5511

Special notes for your reviewer:

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests
  • cherry-pick candidate

Release note:

- Added `AdditionalAPIServerLBPorts` field for the additional inbound control-plane load balancer port configuration, allowing to access user specified ports on the CP machines.
- Improved `securityRules` selection for the `RKE2` provider. Defaulting now checks for ports `6443` and `22` in the existing list and appends them if missing, allowing users to specify only additional rules instead of the full set, when an additional rule is needed.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 28, 2025
@k8s-ci-robot k8s-ci-robot requested review from Jont828 and nojnhuh March 28, 2025 14:27
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 28, 2025
Copy link

codecov bot commented Mar 28, 2025

Codecov Report

Attention: Patch coverage is 90.16393% with 6 lines in your changes missing coverage. Please review.

Project coverage is 52.91%. Comparing base (1c6a5d3) to head (51e7c19).
Report is 21 commits behind head on main.

Files with missing lines Patch % Lines
api/v1beta1/types.go 0.00% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5525      +/-   ##
==========================================
+ Coverage   52.61%   52.91%   +0.29%     
==========================================
  Files         272      272              
  Lines       29485    29515      +30     
==========================================
+ Hits        15513    15617     +104     
+ Misses      13165    13086      -79     
- Partials      807      812       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 28, 2025
@Danil-Grigorev Danil-Grigorev force-pushed the extend-security-rules branch from bec7326 to fbe11d7 Compare March 28, 2025 14:37
@@ -893,6 +893,20 @@ func (s SubnetSpec) IsIPv6Enabled() bool {
return false
}

// GetSecurityRuleByDestination returns security group rule, which matches provided destination ports.
func (s SubnetSpec) GetSecurityRuleByDestination(ports string) *SecurityRule {
if s.SecurityGroup.SecurityRules == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrapping the below for loop in if s.SecurityGroup.SecurityRules != nil { is equivalent and allows us to remove the add'l return nil response (we don't need a specific nil return for this condition because the default return before exiting the function is nil)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed, thanks.

@@ -1020,9 +1020,18 @@ func (s *ClusterScope) SetControlPlaneSecurityRules() {
if !s.ControlPlaneEnabled() {
return
}
if s.ControlPlaneSubnet().SecurityGroup.SecurityRules == nil {

subnet := s.ControlPlaneSubnet()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this flow where are we adding any user-provided security rules? (for example if a user specifies TCP 9345)

or is that elsewhere and the purpose of this change is to filter out 22 and apiserver port if it's not included?

Copy link
Member Author

@Danil-Grigorev Danil-Grigorev Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flow performs only defaulting on top of the user-provided set of rules, which may not be empty. This way having a field populated does not always need to specify all allowed ports, only additional ones or overrides.

I’m currently validating how this works in tandem with ClusterClass definitions we have. The desired goal is to permit specifying registrationMethod: control-plane-endpoint and allow RKE2 to register on the allowed port. But ideally this should follow internal name resolution, so that the traffic wouldn’t go through external load balancer.

If it doesn’t work, maybe it would still require additional LB rules.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems reasonable to have the additional ports of the APIServer LB opened-up/have an opinion about.

Could you please elaborate on why the security rules should not be updated too ?

Copy link
Member Author

@Danil-Grigorev Danil-Grigorev Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, maybe I’m not fully understanding. I will try to give my thoughts on this:

I was attempting to describe the flow of how the rules are processed. We only apply defaults on top of what the user provides, so the user can still specify ports like TCP 9345 explicitly in security rules. We’re not overriding or stripping anything out, just adding fallback values when needed. LB rules were needed as well, since CP endpoint is external from the cluster perspective.

Current set of changes is what was required to make cluster healthy, and added e2e test displays that. My previous comment was a set of assumptions, or “paths to explore later” if it doesn’t work.

@Danil-Grigorev Danil-Grigorev force-pushed the extend-security-rules branch 3 times, most recently from 452799e to df2c355 Compare April 1, 2025 11:16
@Danil-Grigorev Danil-Grigorev force-pushed the extend-security-rules branch from df2c355 to cf4c84b Compare April 1, 2025 15:41
@Danil-Grigorev
Copy link
Member Author

/test ?

@k8s-ci-robot
Copy link
Contributor

@Danil-Grigorev: The following commands are available to trigger required jobs:

/test pull-cluster-api-provider-azure-apiversion-upgrade
/test pull-cluster-api-provider-azure-build
/test pull-cluster-api-provider-azure-ci-entrypoint
/test pull-cluster-api-provider-azure-e2e
/test pull-cluster-api-provider-azure-e2e-aks
/test pull-cluster-api-provider-azure-test
/test pull-cluster-api-provider-azure-verify

The following commands are available to trigger optional jobs:

/test pull-cluster-api-provider-azure-apidiff
/test pull-cluster-api-provider-azure-apiserver-ilb
/test pull-cluster-api-provider-azure-capi-e2e
/test pull-cluster-api-provider-azure-conformance
/test pull-cluster-api-provider-azure-conformance-custom-builds
/test pull-cluster-api-provider-azure-conformance-dual-stack-with-ci-artifacts
/test pull-cluster-api-provider-azure-conformance-ipv6-with-ci-artifacts
/test pull-cluster-api-provider-azure-conformance-with-ci-artifacts
/test pull-cluster-api-provider-azure-conformance-with-ci-artifacts-dra
/test pull-cluster-api-provider-azure-e2e-optional
/test pull-cluster-api-provider-azure-e2e-workload-upgrade
/test pull-cluster-api-provider-azure-load-test-custom-builds
/test pull-cluster-api-provider-azure-windows-custom-builds
/test pull-cluster-api-provider-azure-windows-with-ci-artifacts

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-provider-azure-apidiff
pull-cluster-api-provider-azure-apiversion-upgrade
pull-cluster-api-provider-azure-build
pull-cluster-api-provider-azure-ci-entrypoint
pull-cluster-api-provider-azure-e2e
pull-cluster-api-provider-azure-e2e-aks
pull-cluster-api-provider-azure-test
pull-cluster-api-provider-azure-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 1, 2025
@Danil-Grigorev Danil-Grigorev changed the title ✨ Allow more permissive extensibility for securityRules ✨ Allow more permissive extensibility for securityRules and additional CP LoadBalancers Apr 1, 2025
@Danil-Grigorev Danil-Grigorev force-pushed the extend-security-rules branch 4 times, most recently from d744062 to 0d5e30f Compare April 2, 2025 11:55
@Danil-Grigorev Danil-Grigorev force-pushed the extend-security-rules branch from 9e8e8e1 to 3592388 Compare April 2, 2025 16:49
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from nawazkh. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Danil-Grigorev Danil-Grigorev force-pushed the extend-security-rules branch 5 times, most recently from 14e9903 to 85603df Compare April 2, 2025 20:33
@Danil-Grigorev
Copy link
Member Author

/test pull-cluster-api-provider-azure-e2e-optional

Copy link
Member

@nawazkh nawazkh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Danil-Grigorev 👋🏼 ! Thank you for creating this PR!
Below is an initial round of review. Will iterate again.

@@ -893,6 +906,17 @@ func (s SubnetSpec) IsIPv6Enabled() bool {
return false
}

// GetSecurityRuleByDestination returns security group rule, which matches provided destination ports.
func (s SubnetSpec) GetSecurityRuleByDestination(ports string) *SecurityRule {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

port seems to be holding a single port here.

Suggested change
func (s SubnetSpec) GetSecurityRuleByDestination(ports string) *SecurityRule {
func (s SubnetSpec) GetSecurityRuleByDestination(port string) *SecurityRule {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was originally like so, but users can supply a list o ports with this, like 22,443

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the current implementation does not seem to be iterating over a comma-separated list of ports ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doublechecked it, maybe my initial assumption was wrong. It can be a number or an * there, so changed it to port as suggested.

@@ -893,6 +906,17 @@ func (s SubnetSpec) IsIPv6Enabled() bool {
return false
}

// GetSecurityRuleByDestination returns security group rule, which matches provided destination ports.
func (s SubnetSpec) GetSecurityRuleByDestination(ports string) *SecurityRule {
for _, rule := range s.SecurityGroup.SecurityRules {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably for the long term, we should change s.SecurityGroup.SecurityRules to a dictionary type for faster processing.

@@ -141,22 +146,22 @@ spec:
name: my-subnet-cp-nsg
securityRules:
- name: "allow_ssh"
description: "allow SSH"
description: "Deny SSH"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we updating this example to a Deny SSH ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to showcase that even though port 22 is defaulted, the rule for this can be overridden to deny it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would benefit if we show an overriding scenario the example you added that adds 9345 to the APIServer LB.

Updating the rule to Deny could throw off users in the first glance.

Comment on lines 9 to 12
computeGallery:
gallery: replace-me
name: replace-me
version: replace-me
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this left intentionally with replace-me value ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was replaced by CC patch, but it turns out that having this field set by default causes errors on double template apply, which might have affected e2e execution. Replaced with full template patch now.

@Danil-Grigorev Danil-Grigorev force-pushed the extend-security-rules branch 2 times, most recently from 07e0e7b to f5702e8 Compare April 3, 2025 11:56
@Danil-Grigorev
Copy link
Member Author

Danil-Grigorev commented Apr 3, 2025

The tests are still failing due to this error:

  I0403 10:40:04.800626   45679 warning_handler.go:65] "unknown field \"spec.template.spec.networkSpec.additionalAPIServerLBPorts\"" logger="KubeAPIWarningLogger"

even though the template in artifacts has the value and CRD changes are a part of this PR. Is there something missing, please advise? (Edit: my mistake - wrong variable name)

@Danil-Grigorev Danil-Grigorev force-pushed the extend-security-rules branch 3 times, most recently from 6fdfb5f to b66e8fe Compare April 3, 2025 18:23
@Danil-Grigorev
Copy link
Member Author

/test pull-cluster-api-provider-azure-e2e-optional

@Danil-Grigorev
Copy link
Member Author

/retest

1 similar comment
@Danil-Grigorev
Copy link
Member Author

/retest

Copy link
Member

@nawazkh nawazkh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great PR, a lot of work has been put in here. Thank you @Danil-Grigorev !
I am glad that the optional RKE2 test is working!

If I could request, could you please add unit test cases to cluster_test.go and loadbalancers/spec_test.go to support your changes.
I also added some comments below, and would love to hear your thoughts!

@@ -111,9 +111,22 @@ type NetworkSpec struct {
// +optional
ControlPlaneOutboundLB *LoadBalancerSpec `json:"controlPlaneOutboundLB,omitempty"`

// AdditionalControlPlaneLBPorts is the configuration for the additional inbound control-plane load balancer ports
// +optional
AdditionalControlPlaneLBPorts []LoadBalancerPort `json:"additionalControlPlaneLBPorts,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of renaming AdditionalControlPlaneLBPorts to AdditionalAPIServerLBPorts ?

CAPZ has APIServerLB, NodeOutboundLB and ControlplaneOutboundLB. Since the additional ports will be used in APIServerLB, it is better, I believe, to have the additional ports renamed to AdditionalAPIServerLBPorts

@@ -893,6 +906,17 @@ func (s SubnetSpec) IsIPv6Enabled() bool {
return false
}

// GetSecurityRuleByDestination returns security group rule, which matches provided destination ports.
func (s SubnetSpec) GetSecurityRuleByDestination(ports string) *SecurityRule {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the current implementation does not seem to be iterating over a comma-separated list of ports ?

Comment on lines 78 to 81

// AdditionalControlPlaneLBPorts is the configuration for the additional inbound control-plane load balancer ports
// +optional
AdditionalControlPlaneLBPorts []LoadBalancerPort `json:"additionalControlPlaneLBPorts,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@@ -1020,9 +1020,18 @@ func (s *ClusterScope) SetControlPlaneSecurityRules() {
if !s.ControlPlaneEnabled() {
return
}
if s.ControlPlaneSubnet().SecurityGroup.SecurityRules == nil {

subnet := s.ControlPlaneSubnet()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems reasonable to have the additional ports of the APIServer LB opened-up/have an opinion about.

Could you please elaborate on why the security rules should not be updated too ?

Comment on lines 1033 to 1035
if subnet.SecurityGroup.SecurityRules == nil {
s.AzureCluster.Spec.NetworkSpec.UpdateControlPlaneSubnet(subnet)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add more context on setting s.AzureCluster.Spec.NetworkSpec.UpdateControlPlaneSubnet(subnet) three times in this function ?
Could we not update it once in the end of the function ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, but by equivalent aggregation of ifs evaluating to true. Since the method is only setting struct default values, I assumed the cost is negligible, for the improved readability and confidence of data being persisted.

}
count := 0
for _, machine := range machineList.Items {
if machine.Status.NodeRef != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have a more granular check here instead of checking for not nil ?

Copy link
Member Author

@Danil-Grigorev Danil-Grigorev Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated it to Ready condition, which:

is true if the Machine's deletionTimestamp is not set, Machine's BootstrapConfigReady, InfrastructureReady,
	// NodeHealthy and HealthCheckSucceeded

But the main goal of the check is to verify that CP nodes joined cluster in the expected quantity. Previous nil check fulfilled that too…

@@ -141,22 +146,22 @@ spec:
name: my-subnet-cp-nsg
securityRules:
- name: "allow_ssh"
description: "allow SSH"
description: "Deny SSH"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would benefit if we show an overriding scenario the example you added that adds 9345 to the APIServer LB.

Updating the rule to Deny could throw off users in the first glance.

@Danil-Grigorev Danil-Grigorev force-pushed the extend-security-rules branch from 0091001 to 34ffcbb Compare April 10, 2025 16:21
@Danil-Grigorev Danil-Grigorev force-pushed the extend-security-rules branch from 34ffcbb to b408363 Compare April 10, 2025 16:29
@Danil-Grigorev
Copy link
Member Author

/retest

@nawazkh
Copy link
Member

nawazkh commented Apr 10, 2025

/test pull-cluster-api-provider-azure-e2e

Signed-off-by: Danil-Grigorev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

Implement arbitrary additional ingress rules to support CAPRKE2
5 participants