Skip to content

e2e: delete worker machines and ensure recovery #22090

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

derekwaynecarr
Copy link
Member

Add disruptive testing to ensure if we delete worker machines, we recover with new machines and ready nodes in 5 minutes.

/cc @enxebre

@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 20, 2019
@derekwaynecarr
Copy link
Member Author

I want to ensure we get graceful termination signal for an existing pod that scheduled to worker.

Probably by verifying a terminationLogMessage or some equivalent.

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 20, 2019
@derekwaynecarr
Copy link
Member Author

actually we can handle termination log message in a follow-on.
prefer to get more testing in at this point.

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 20, 2019
@derekwaynecarr
Copy link
Member Author

@enxebre is there a pr i missed that was handling the usage of the sigs.k8s.io labels?

@derekwaynecarr derekwaynecarr changed the title Disruptive worker tests delete machines and ensure recovery e2e: Disruptive worker tests delete machines and ensure recovery Feb 20, 2019
@derekwaynecarr derekwaynecarr changed the title e2e: Disruptive worker tests delete machines and ensure recovery e2e: delete worker machines and ensure recovery Feb 20, 2019
)

const (
// TODO: cloud team may change this when changing label keys
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@enxebre
Copy link
Member

enxebre commented Feb 20, 2019

lgtm cc @spangenberg for coordinating with labels openshift/cluster-api-provider-aws#161 (comment)
and @ingvagabund for draining

@derekwaynecarr
Copy link
Member Author

/test e2e-aws
/test e2e-aws-serial

@enxebre
Copy link
Member

enxebre commented Feb 25, 2019

failed to applied terraform:

level=error msg="Error: Error applying plan:"
level=error
level=error msg="1 error occurred:"
level=error msg="\t* aws_route53_zone.int: 1 error occurred:"
level=error msg="\t* aws_route53_zone.int: error waiting for Route53 Hosted Zone (Z3ALYAV3UFHIQD) creation: timeout while waiting for state to become 'INSYNC' (last state: 'PENDING', timeout: 15m0s)"

/retest

Copy link

@spangenberg spangenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since openshift/cluster-api-provider-aws#161 got merged we should change the label now. @derekwaynecarr could you apply the change.


const (
// TODO: cloud team may change this when changing label keys
machineLabelSelectorWorker = "sigs.k8s.io/cluster-api-machine-role=worker"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
machineLabelSelectorWorker = "sigs.k8s.io/cluster-api-machine-role=worker"
machineLabelSelectorWorker = "machine.openshift.io/cluster-api-machine-role=worker"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spangenberg This will never pass the test with that label unless openshift/installer#1263 gets merged, which is waiting to be rebased

@derekwaynecarr
Copy link
Member Author

i can hold this pr until the requisite prs land.

@derekwaynecarr
Copy link
Member Author

cluster storage operator failed to report.

level=info msg="Waiting up to 30m0s for the cluster to initialize..."
level=fatal msg="failed to initialize the cluster: Cluster operator cluster-storage-operator has not yet reported success"

/test e2e-aws

@enxebre
Copy link
Member

enxebre commented Feb 28, 2019

@derekwaynecarr
openshift/installer#1263 got merged now, we need to update the labels here

@smarterclayton
Copy link
Contributor

As discussed in chat, the fact that we wait 5m to detect a deleted instance from the cloud provider is a bug. it was probably done to rate limit queries to aws, but we should probably try to reduce the detection time of that specific condition to reduce this pain point.

@enxebre
Copy link
Member

enxebre commented Mar 4, 2019

@smarterclayton @derekwaynecarr since we drain nodes on deletion, they will go unschedulable right away so the scheduler does not take it into consideration while it waits to be garbage collected.

Also I'm thinking on getting the NodeRef when deleting a machine and delete also the backed node right away from the machine controller after the actuator succeed deleting the cloud instance, so no need to wait for the cloud nodelifecycle controller to garbage collect the orphan node. Does this sounds reasonable?

@derekwaynecarr
Copy link
Member Author

@enxebre since you know the machine was actually deleted, i do not see a major risk doing it there, but I do think we need to look at node lifecycle controller as it should look at all nodes every 5s looking at its monitor period.

either way, this pr is ready to go.

@derekwaynecarr
Copy link
Member Author

/retest

1 similar comment
@ingvagabund
Copy link
Member

/retest

return result
}

// nodeNames returns the names of nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should be machineNames.

)

const (
machineLabelSelectorWorker = "machine.openshift.io/cluster-api-machine-role=worker"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a blocker: we are trying to reduce deps on openshift specific annotations so we could fetch worker machines by matching the associated node label

@enxebre
Copy link
Member

enxebre commented Mar 29, 2019

To clarify now machine controller gets rid of the node object after removing the instance. Also the node-monitor-grace-period: is currently being set to 5min https://github.com/openshift/cluster-kube-controller-manager-operator/blob/master/bindata/v3.11.0/kube-controller-manager/defaultconfig.yaml#L29

@enxebre
Copy link
Member

enxebre commented Mar 29, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 29, 2019
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@enxebre enxebre dismissed spangenberg’s stale review March 29, 2019 15:59

this is updated now. Daniel can't re-approve as he's off

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit be1b971 into openshift:master Mar 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants