e2e: delete worker machines and ensure recovery #22090

derekwaynecarr · 2019-02-20T01:41:42Z

Add disruptive testing to ensure if we delete worker machines, we recover with new machines and ready nodes in 5 minutes.

/cc @enxebre

derekwaynecarr · 2019-02-20T01:42:49Z

I want to ensure we get graceful termination signal for an existing pod that scheduled to worker.

Probably by verifying a terminationLogMessage or some equivalent.

/hold

derekwaynecarr · 2019-02-20T03:39:28Z

actually we can handle termination log message in a follow-on.
prefer to get more testing in at this point.

/hold cancel

derekwaynecarr · 2019-02-20T03:40:42Z

@enxebre is there a pr i missed that was handling the usage of the sigs.k8s.io labels?

enxebre · 2019-02-20T08:25:11Z

test/extended/machines/workers.go

+)
+
+const (
+	// TODO: cloud team may change this when changing label keys


cc @spangenberg openshift/cluster-api-provider-aws#161 (comment)

openshift/cluster-api-provider-aws#161 got merged 2 days ago.

enxebre · 2019-02-20T08:42:18Z

lgtm cc @spangenberg for coordinating with labels openshift/cluster-api-provider-aws#161 (comment)
and @ingvagabund for draining

derekwaynecarr · 2019-02-24T21:49:13Z

/test e2e-aws
/test e2e-aws-serial

enxebre · 2019-02-25T08:21:23Z

failed to applied terraform:

level=error msg="Error: Error applying plan:"
level=error
level=error msg="1 error occurred:"
level=error msg="\t* aws_route53_zone.int: 1 error occurred:"
level=error msg="\t* aws_route53_zone.int: error waiting for Route53 Hosted Zone (Z3ALYAV3UFHIQD) creation: timeout while waiting for state to become 'INSYNC' (last state: 'PENDING', timeout: 15m0s)"

/retest

spangenberg

Since openshift/cluster-api-provider-aws#161 got merged we should change the label now. @derekwaynecarr could you apply the change.

spangenberg · 2019-02-25T10:58:17Z

test/extended/machines/workers.go

+
+const (
+	// TODO: cloud team may change this when changing label keys
+	machineLabelSelectorWorker = "sigs.k8s.io/cluster-api-machine-role=worker"


Suggested change

machineLabelSelectorWorker = "sigs.k8s.io/cluster-api-machine-role=worker"

machineLabelSelectorWorker = "machine.openshift.io/cluster-api-machine-role=worker"

@spangenberg This will never pass the test with that label unless openshift/installer#1263 gets merged, which is waiting to be rebased

derekwaynecarr · 2019-02-25T15:17:09Z

i can hold this pr until the requisite prs land.

derekwaynecarr · 2019-02-25T15:25:51Z

cluster storage operator failed to report.

level=info msg="Waiting up to 30m0s for the cluster to initialize..."
level=fatal msg="failed to initialize the cluster: Cluster operator cluster-storage-operator has not yet reported success"

/test e2e-aws

enxebre · 2019-02-28T17:08:54Z

@derekwaynecarr
openshift/installer#1263 got merged now, we need to update the labels here

smarterclayton · 2019-03-01T19:50:28Z

As discussed in chat, the fact that we wait 5m to detect a deleted instance from the cloud provider is a bug. it was probably done to rate limit queries to aws, but we should probably try to reduce the detection time of that specific condition to reduce this pain point.

enxebre · 2019-03-04T14:29:37Z

@smarterclayton @derekwaynecarr since we drain nodes on deletion, they will go unschedulable right away so the scheduler does not take it into consideration while it waits to be garbage collected.

Also I'm thinking on getting the NodeRef when deleting a machine and delete also the backed node right away from the machine controller after the actuator succeed deleting the cloud instance, so no need to wait for the cloud nodelifecycle controller to garbage collect the orphan node. Does this sounds reasonable?

derekwaynecarr · 2019-03-07T19:43:06Z

@enxebre since you know the machine was actually deleted, i do not see a major risk doing it there, but I do think we need to look at node lifecycle controller as it should look at all nodes every 5s looking at its monitor period.

either way, this pr is ready to go.

derekwaynecarr · 2019-03-20T04:32:17Z

/retest

ingvagabund · 2019-03-20T09:01:21Z

/retest

frobware · 2019-03-20T10:42:37Z

test/extended/machines/workers.go

+	return result
+}
+
+// nodeNames returns the names of nodes


nit: should be machineNames.

enxebre · 2019-03-20T11:13:24Z

test/extended/machines/workers.go

+)
+
+const (
+	machineLabelSelectorWorker = "machine.openshift.io/cluster-api-machine-role=worker"


not a blocker: we are trying to reduce deps on openshift specific annotations so we could fetch worker machines by matching the associated node label

enxebre · 2019-03-29T14:27:52Z

To clarify now machine controller gets rid of the node object after removing the instance. Also the node-monitor-grace-period: is currently being set to 5min https://github.com/openshift/cluster-kube-controller-manager-operator/blob/master/bindata/v3.11.0/kube-controller-manager/defaultconfig.yaml#L29

enxebre · 2019-03-29T15:33:11Z

/lgtm

openshift-ci-robot · 2019-03-29T15:33:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/extended/OWNERS~~ [derekwaynecarr]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

this is updated now. Daniel can't re-approve as he's off

openshift-bot · 2019-03-29T17:02:41Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2019-03-29T22:14:32Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot requested a review from enxebre February 20, 2019 01:41

openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 20, 2019

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 20, 2019

derekwaynecarr force-pushed the disrupt-workers branch from 57863ed to 52a75a0 Compare February 20, 2019 03:38

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 20, 2019

derekwaynecarr changed the title ~~Disruptive worker tests delete machines and ensure recovery~~ e2e: Disruptive worker tests delete machines and ensure recovery Feb 20, 2019

derekwaynecarr changed the title ~~e2e: Disruptive worker tests delete machines and ensure recovery~~ e2e: delete worker machines and ensure recovery Feb 20, 2019

enxebre reviewed Feb 20, 2019

View reviewed changes

enxebre mentioned this pull request Feb 20, 2019

Rename labels from sigs.k8s.io to machine.openshift.io openshift/cluster-api-provider-aws#161

Merged

spangenberg previously requested changes Feb 25, 2019

View reviewed changes

enxebre mentioned this pull request Mar 6, 2019

Delete node after removing cloud instance on deletion openshift/cluster-api-provider-aws#173

Merged

Disruptive worker tests delete machines and ensure recovery

713c47b

derekwaynecarr force-pushed the disrupt-workers branch from 52a75a0 to 713c47b Compare March 7, 2019 19:30

frobware reviewed Mar 20, 2019

View reviewed changes

test/extended/machines/workers.go

return result

}

// nodeNames returns the names of nodes

Copy link

Contributor

frobware Mar 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should be machineNames.

enxebre reviewed Mar 20, 2019

View reviewed changes

openshift-ci-robot assigned enxebre Mar 29, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 29, 2019

openshift-merge-robot merged commit be1b971 into openshift:master Mar 29, 2019

	machineLabelSelectorWorker = "sigs.k8s.io/cluster-api-machine-role=worker"
	machineLabelSelectorWorker = "machine.openshift.io/cluster-api-machine-role=worker"

e2e: delete worker machines and ensure recovery #22090

e2e: delete worker machines and ensure recovery #22090

Uh oh!

Conversation

derekwaynecarr commented Feb 20, 2019

Uh oh!

derekwaynecarr commented Feb 20, 2019

Uh oh!

derekwaynecarr commented Feb 20, 2019

Uh oh!

derekwaynecarr commented Feb 20, 2019

Uh oh!

enxebre Feb 20, 2019

Choose a reason for hiding this comment

Uh oh!

spangenberg Feb 25, 2019

Choose a reason for hiding this comment

Uh oh!

enxebre commented Feb 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

derekwaynecarr commented Feb 24, 2019

Uh oh!

enxebre commented Feb 25, 2019

Uh oh!

spangenberg left a comment

Choose a reason for hiding this comment

Uh oh!

spangenberg Feb 25, 2019

Choose a reason for hiding this comment

Uh oh!

enxebre Feb 25, 2019

Choose a reason for hiding this comment

Uh oh!

derekwaynecarr commented Feb 25, 2019

Uh oh!

derekwaynecarr commented Feb 25, 2019

Uh oh!

enxebre commented Feb 28, 2019

Uh oh!

smarterclayton commented Mar 1, 2019

Uh oh!

enxebre commented Mar 4, 2019

Uh oh!

derekwaynecarr commented Mar 7, 2019

Uh oh!

derekwaynecarr commented Mar 20, 2019

Uh oh!

ingvagabund commented Mar 20, 2019

Uh oh!

frobware Mar 20, 2019

Choose a reason for hiding this comment

Uh oh!

enxebre Mar 20, 2019

Choose a reason for hiding this comment

Uh oh!

enxebre commented Mar 29, 2019

Uh oh!

enxebre commented Mar 29, 2019

Uh oh!

openshift-ci-robot commented Mar 29, 2019

Uh oh!

openshift-bot commented Mar 29, 2019

Uh oh!

openshift-bot commented Mar 29, 2019

Uh oh!

Uh oh!

enxebre commented Feb 20, 2019 •

edited

Loading