Certificate rotation broken in long running environments #5000
Description
Bug description:
I noticed that in 2 of my setups running for several days, certificate rotation is broken. This leads to expired certificates in Envoy never being rotated, leading to complete traffic disruption between apps.
Log on the client:
{"bytes_received":0,"response_flags":"UF","upstream_service_time":null,"response_code":503,"start_time":"2022-08-15T18:02:56.864Z","authority":"fortio.demo.svc.cluster.local:8080","duration":1,"bytes_sent":195,"protocol":"HTTP/1.1","x_forwarded_for":null,"path":"/","request_id":"26aa1f49-a09b-4b13-81bd-0acfa49852b8","user_agent":"fortio.org/fortio-1.34.1","response_code_details":"upstream_reset_before_response_started{connection_failure,TLS_error:_268435581:SSL_routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED}","time_to_first_byte":null,"requested_server_name":null,"upstream_host":"10.244.1.5:8080","method":"GET","upstream_cluster":"demo/fortio|8080"}
Stat on the client:
cluster.demo/fortio|8080.ssl.fail_verify_error: 50
The certs
stat osm_bug_report_2928885602/namespaces/demo/pods/fortio-client-b9b7bbfb8-hc9wr/commands/osm_proxy_get_certs_fortio-client-b9b7bbfb8-hc9wr_-n_demo
confirms the cert not being updated, with its client cert having an expiration date of 2022-08-12T19:05:52Z
. The expiration date should be past 2022-08-15
(current date).
Both the client and server are connected to the controller as per the XDS cluster stats collected in the bug-report.
osm-controller indicates the cert default.demo.svc.cluster.local
has expired but has not been rotated:
Common Name: "default.demo.cluster.local"
Valid Until: 2022-08-12 19:05:52.0391624 +0000 UTC m=+86504.429431801 (16h45m27.4651171s remaining)
Issuing CA (SHA256): d4bc5ec6f2ab02a7f484f5c36ee90222435250592d0942c686737ba0a77e857e
Trusted CAs (SHA256): d4bc5ec6f2ab02a7f484f5c36ee90222435250592d0942c686737ba0a77e857e
Cert Chain (SHA256): c4b68f17387579bff64ee155ae7263687c2d3fbea929ad56b0489b6ebd074dd8
x509.SignatureAlgorithm: SHA256-RSA
x509.PublicKeyAlgorithm: RSA
x509.Version: 3
x509.SerialNumber: 980b07dedb46443624884fcfcd9b9f03
x509.Issuer: CN=osm-ca.openservicemesh.io,O=Open Service Mesh,L=CA,C=US
x509.Subject: CN=default.demo.cluster.local,O=Open Service Mesh
x509.NotBefore (begin): 2022-08-11 19:05:52 +0000 UTC (95h30m32.0984793s ago)
x509.NotAfter (end): 2022-08-12 19:05:52 +0000 UTC (-71h30m32.0984815s remaining)
x509.BasicConstraintsValid: true
x509.IsCA: false
x509.DNSNames: [default.demo.cluster.local]
Cert struct expiration vs. x509.NotAfter: -39.1624ms
x509.NotBefore (begin): 2022-08-11 19:05:52 +0000 UTC (95h30m32.0984793s ago)
x509.NotAfter (end): 2022-08-12 19:05:52 +0000 UTC (-71h30m32.0984815s remaining)
Affected area (please mark with X where applicable):
- Certificate Management [X]
Expected behavior:
Certificates should be rotated in long running environments.
Steps to reproduce the bug (as precisely as possible):
I observed this bug twice while executing the demo global rate limit demo over multiple days.
How was OSM installed?:
osm install --set osm.image.registry=$CTR_REGISTRY --set osm.image.tag=$CTR_TAG --set osm.image.pullPolicy=Always --set osm.enablePermissiveTrafficPolicy=true
Bug report archive:
2313848016_osm-bug-report.tar.gz
Environment:
- OSM version (use
osm version
):latest-main
- Kubernetes version (use
kubectl version
):v1.23.4
Metadata
Metadata
Assignees
Type
Projects
Status