Skip to content
This repository was archived by the owner on Jul 11, 2023. It is now read-only.
This repository was archived by the owner on Jul 11, 2023. It is now read-only.

Certificate rotation broken in long running environments #5000

Closed
@shashankram

Description

@shashankram

Bug description:
I noticed that in 2 of my setups running for several days, certificate rotation is broken. This leads to expired certificates in Envoy never being rotated, leading to complete traffic disruption between apps.

Log on the client:

{"bytes_received":0,"response_flags":"UF","upstream_service_time":null,"response_code":503,"start_time":"2022-08-15T18:02:56.864Z","authority":"fortio.demo.svc.cluster.local:8080","duration":1,"bytes_sent":195,"protocol":"HTTP/1.1","x_forwarded_for":null,"path":"/","request_id":"26aa1f49-a09b-4b13-81bd-0acfa49852b8","user_agent":"fortio.org/fortio-1.34.1","response_code_details":"upstream_reset_before_response_started{connection_failure,TLS_error:_268435581:SSL_routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED}","time_to_first_byte":null,"requested_server_name":null,"upstream_host":"10.244.1.5:8080","method":"GET","upstream_cluster":"demo/fortio|8080"}

Stat on the client:
cluster.demo/fortio|8080.ssl.fail_verify_error: 50

The certs stat osm_bug_report_2928885602/namespaces/demo/pods/fortio-client-b9b7bbfb8-hc9wr/commands/osm_proxy_get_certs_fortio-client-b9b7bbfb8-hc9wr_-n_demo confirms the cert not being updated, with its client cert having an expiration date of 2022-08-12T19:05:52Z. The expiration date should be past 2022-08-15 (current date).

Both the client and server are connected to the controller as per the XDS cluster stats collected in the bug-report.

osm-controller indicates the cert default.demo.svc.cluster.local has expired but has not been rotated:

	 Common Name: "default.demo.cluster.local"
	 Valid Until: 2022-08-12 19:05:52.0391624 +0000 UTC m=+86504.429431801 (16h45m27.4651171s remaining)
	 Issuing CA (SHA256): d4bc5ec6f2ab02a7f484f5c36ee90222435250592d0942c686737ba0a77e857e
	 Trusted CAs (SHA256): d4bc5ec6f2ab02a7f484f5c36ee90222435250592d0942c686737ba0a77e857e
	 Cert Chain (SHA256): c4b68f17387579bff64ee155ae7263687c2d3fbea929ad56b0489b6ebd074dd8
	 x509.SignatureAlgorithm: SHA256-RSA
	 x509.PublicKeyAlgorithm: RSA
	 x509.Version: 3
	 x509.SerialNumber: 980b07dedb46443624884fcfcd9b9f03
	 x509.Issuer: CN=osm-ca.openservicemesh.io,O=Open Service Mesh,L=CA,C=US
	 x509.Subject: CN=default.demo.cluster.local,O=Open Service Mesh
	 x509.NotBefore (begin): 2022-08-11 19:05:52 +0000 UTC (95h30m32.0984793s ago)
	 x509.NotAfter (end): 2022-08-12 19:05:52 +0000 UTC (-71h30m32.0984815s remaining)
	 x509.BasicConstraintsValid: true
	 x509.IsCA: false
	 x509.DNSNames: [default.demo.cluster.local]
	 Cert struct expiration vs. x509.NotAfter: -39.1624ms
x509.NotBefore (begin): 2022-08-11 19:05:52 +0000 UTC (95h30m32.0984793s ago)
x509.NotAfter (end): 2022-08-12 19:05:52 +0000 UTC (-71h30m32.0984815s remaining)

Affected area (please mark with X where applicable):

  • Certificate Management [X]

Expected behavior:
Certificates should be rotated in long running environments.

Steps to reproduce the bug (as precisely as possible):
I observed this bug twice while executing the demo global rate limit demo over multiple days.

How was OSM installed?:

osm install --set osm.image.registry=$CTR_REGISTRY --set osm.image.tag=$CTR_TAG --set osm.image.pullPolicy=Always --set osm.enablePermissiveTrafficPolicy=true

Bug report archive:

2313848016_osm-bug-report.tar.gz

Environment:

  • OSM version (use osm version): latest-main
  • Kubernetes version (use kubectl version): v1.23.4

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions