Skip to content

🐛 fix: add registry client TLS connection idle timeout #389

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

mhrabovcin
Copy link
Contributor

What this PR does / why we need it:
Currently the controller runtime creates a new registry client for interaction with OCI registries when HelmReleaseProxy or HelmChartProxy is reconciled. The client is not being reused and is discarded after the reconciliation loop for a single item exits. When a TLS configuration is provided in k8s resources spec the controller uses helm library NewRegistryClientWithTLS function, that creates a http.Client and http.Transport on for each registry client instance.

This causes a problem with registry servers that won't close the http connection. The http.Transport implementation will keep the connection open and ready for re-use. This can result in many leaked goroutines with TCP connections opened in memory with no chance of reusing those.

In our tests when the controller was handling 100s HelmReleaseProxy and HelmChartProxy objects the memory consumption was high, and the controller hit a limit of maximum number of ephemeral TCP ports for outgoing TCP connections.

$ ss -s
Total: 28236
TCP:   29500 (estab 28233, closed 1264, orphaned 0, timewait 9)

Transport Total     IP        IPv6
RAW	  0         0         0
UDP	  0         0         0
TCP	  28236     28233     3
INET	  28236     28233     3
FRAG	  0         0         0

By adding the IdleConnTimeout the TCP connections are correctly closed by the client and goroutines are cleaned up. Example:

$ ss -s
Total: 7
TCP:   1277 (estab 3, closed 1270, orphaned 0, timewait 8)

Transport Total     IP        IPv6
RAW       0         0         0
UDP       0         0         0
TCP       7         2         5
INET      7         2         5
FRAG      0         0         0

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 17, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @mhrabovcin!

It looks like this is your first PR to kubernetes-sigs/cluster-api-addon-provider-helm 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api-addon-provider-helm has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 17, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @mhrabovcin. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 17, 2025
@tuxtof
Copy link
Contributor

tuxtof commented Apr 17, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 17, 2025
@mhrabovcin mhrabovcin changed the title fix: add registry client TLS connection idle timeout 🐛 🐛 fix: add registry client TLS connection idle timeout Apr 17, 2025
Copy link
Contributor

@Jont828 Jont828 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes seem reasonable! Do you think it would be feasible to add some unit tests given that we have quite a few more functions and if/else branches?

@mhrabovcin mhrabovcin force-pushed the registry-client-idle-timeout branch from 18fc46c to 83da56f Compare April 19, 2025 13:42
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 19, 2025
@mhrabovcin mhrabovcin force-pushed the registry-client-idle-timeout branch from 34852b7 to 85b8c39 Compare April 19, 2025 13:44
@mhrabovcin
Copy link
Contributor Author

Thanks for the review @Jont828 . I've added unit tests for the refactored function:

sigs.k8s.io/cluster-api-addon-provider-helm/internal/helm_client.go:261:                newDefaultRegistryClient                88.9%
sigs.k8s.io/cluster-api-addon-provider-helm/internal/helm_client.go:293:                newClientTLS                            100.0%
sigs.k8s.io/cluster-api-addon-provider-helm/internal/helm_client.go:309:                certPoolFromFile                        100.0%

@mhrabovcin mhrabovcin requested a review from Jont828 April 23, 2025 09:58
Copy link
Contributor

@Jont828 Jont828 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor comments. After that, please squash commits and we will be good to go!

return registry.NewClient(opts...)
}

func newClientTLS(caFile string, insecureSkipTLSverify bool) (*tls.Config, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a doc comment for this function?

return &config, nil
}

func certPoolFromFile(filename string) (*x509.CertPool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a doc comment for this as well?

func certPoolFromFile(filename string) (*x509.CertPool, error) {
b, err := os.ReadFile(filename)
if err != nil {
return nil, errors.Errorf("can't read CA file: %v", filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, errors.Errorf("can't read CA file: %v", filename)
return nil, errors.Wrapf(err, "...")

Just so we don't lose the original error message

@mhrabovcin mhrabovcin force-pushed the registry-client-idle-timeout branch from e7a2929 to 4efe7c4 Compare April 29, 2025 09:07
@mhrabovcin mhrabovcin force-pushed the registry-client-idle-timeout branch from 4efe7c4 to 6432b3f Compare April 29, 2025 11:40
@Jont828
Copy link
Contributor

Jont828 commented Apr 29, 2025

Thanks for working on this!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 29, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jont828, mhrabovcin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 29, 2025
@k8s-ci-robot k8s-ci-robot merged commit 7fa1379 into kubernetes-sigs:main Apr 29, 2025
11 checks passed
@mhrabovcin mhrabovcin deleted the registry-client-idle-timeout branch April 30, 2025 08:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants