-
Notifications
You must be signed in to change notification settings - Fork 4.7k
e2e: delete worker machines and ensure recovery #22090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
e2e: delete worker machines and ensure recovery #22090
Conversation
I want to ensure we get graceful termination signal for an existing pod that scheduled to worker. Probably by verifying a terminationLogMessage or some equivalent. /hold |
57863ed
to
52a75a0
Compare
actually we can handle termination log message in a follow-on. /hold cancel |
@enxebre is there a pr i missed that was handling the usage of the sigs.k8s.io labels? |
test/extended/machines/workers.go
Outdated
) | ||
|
||
const ( | ||
// TODO: cloud team may change this when changing label keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
openshift/cluster-api-provider-aws#161 got merged 2 days ago.
lgtm cc @spangenberg for coordinating with labels openshift/cluster-api-provider-aws#161 (comment) |
/test e2e-aws |
failed to applied terraform:
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since openshift/cluster-api-provider-aws#161 got merged we should change the label now. @derekwaynecarr could you apply the change.
test/extended/machines/workers.go
Outdated
|
||
const ( | ||
// TODO: cloud team may change this when changing label keys | ||
machineLabelSelectorWorker = "sigs.k8s.io/cluster-api-machine-role=worker" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
machineLabelSelectorWorker = "sigs.k8s.io/cluster-api-machine-role=worker" | |
machineLabelSelectorWorker = "machine.openshift.io/cluster-api-machine-role=worker" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spangenberg This will never pass the test with that label unless openshift/installer#1263 gets merged, which is waiting to be rebased
i can hold this pr until the requisite prs land. |
cluster storage operator failed to report. level=info msg="Waiting up to 30m0s for the cluster to initialize..."
level=fatal msg="failed to initialize the cluster: Cluster operator cluster-storage-operator has not yet reported success" /test e2e-aws |
@derekwaynecarr |
As discussed in chat, the fact that we wait 5m to detect a deleted instance from the cloud provider is a bug. it was probably done to rate limit queries to aws, but we should probably try to reduce the detection time of that specific condition to reduce this pain point. |
@smarterclayton @derekwaynecarr since we drain nodes on deletion, they will go unschedulable right away so the scheduler does not take it into consideration while it waits to be garbage collected. Also I'm thinking on getting the |
52a75a0
to
713c47b
Compare
@enxebre since you know the machine was actually deleted, i do not see a major risk doing it there, but I do think we need to look at node lifecycle controller as it should look at all nodes every 5s looking at its monitor period. either way, this pr is ready to go. |
/retest |
1 similar comment
/retest |
return result | ||
} | ||
|
||
// nodeNames returns the names of nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should be machineNames.
) | ||
|
||
const ( | ||
machineLabelSelectorWorker = "machine.openshift.io/cluster-api-machine-role=worker" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a blocker: we are trying to reduce deps on openshift specific annotations so we could fetch worker machines by matching the associated node label
To clarify now machine controller gets rid of the node object after removing the instance. Also the |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: derekwaynecarr, enxebre The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
this is updated now. Daniel can't re-approve as he's off
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
Add disruptive testing to ensure if we delete worker machines, we recover with new machines and ready nodes in 5 minutes.
/cc @enxebre