Skip to content

Globalnet gateway monitor exits if controller startup fails after leadership acquired #3337

@tpantelis

Description

@tpantelis

In a production environment with several gateway nodes, it was observed that each globalnet pod had restarted numerous times over several weeks. Each time it had just acquired leadership and was attempting to start the controllers but a transient API server error occurred, eg:

Error starting the controllers : error="error creating the ClusterGlobalEgressIP controller: error retrieving ClusterGlobalEgressIP resource \"cluster-egress.submariner.io\": the server was unable to return a response in the time allotted, but may still be processing the request"

This results in the process exiting with a fatal error. When leadership is lost we keep the process running so we should do the same if an error occurs starting the controllers. Most of the time leader election had also timed out and relinquished leadership so it's not necessary to exit. But if that's not the case then either explicitly relinquish leadership or try to start the controllers again.

Metadata

Metadata

Assignees

Type

No type

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions