-
Notifications
You must be signed in to change notification settings - Fork 64
try to provide most recent copy of AW to enable fast deletion #558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
What do you mean by best effort? |
meaning if AW is submitted and scheduleNext picks up, it will go through all the dispatch logic, when the time comes to API server update, this will fail which will cause AW to queue and then later delete. so there is a delay of approx 1 dispatch cycle to delete AW. |
The issue tagged in the PR could explain why this PR is needed. I submitted 1K AWs and deleted them with oc delete and later resubmitted the same 1K AWs. it now takes about >2 mins to schedule 1st AW from resubmitted 1K. I think is an improvement to the delays seen in the issue. |
It needs some more testing for quick deletions at scale. reduced the retries and also made a choice to not retry when there is a conflict error with etcd, the bias is to trust etcd more certainly in case of deletions. |
@@ -932,7 +932,7 @@ func (qjm *XController) ScheduleNext() { | |||
// the appwrapper from being added in syncjob | |||
defer qjm.schedulingAWAtomicSet(nil) | |||
|
|||
scheduleNextRetrier := retrier.New(retrier.ExponentialBackoff(10, 100*time.Millisecond), &EtcdErrorClassifier{}) | |||
scheduleNextRetrier := retrier.New(retrier.ExponentialBackoff(1, 100*time.Millisecond), &EtcdErrorClassifier{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are doing this, I think you are better off removing the retry logic from here and using it only when updating etcd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, adding retry at the relevant spot is TDB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the changes to ScheduleNext
function need to be a little deeper. Reducing the number of retry is not the right approach. IMHO for this PR to pass the change will need to be reverted or the retry needs to be moved to the individual etcd
Updates.
I think you need to look more closely and the interactions between the worker/ScheduleNext/UpdateQueueJob threads. I suspect that UpdateQueueJob is interfering with the others. I strongly recommend re-evaluating its purpose (IMHO it should be removed) |
Reducing/No retry seems the correct approach when scheduleNext function is processing an AW and an external client deletes the same AW from etcd. |
Thanks, I think the interaction between two threads is an orthogonal issue. it is easy to turn off upatequeuejob thread but we lose the ability to update the state of the AW once it is running. |
We should probably consider a more extensive revision of the handling of deletion. IMHO we should add a finalizer to the AppWrapper object on first encounter (and make sure the update went through). This finalizer then makes it possible to use the builtin deletion timestamp from Kubernetes as the trigger for deletion, not the object removal, the latter being unreliable. Updating the AppWrapper at deletion time, say to set the status to Terminating, then becomes entirely optional. |
/lgtm +1 for an over hall of the deletion logic to stick to idiomatic Kubernetes as suggested by @tardieu. This also eventually be reworked when we'll migrate over controller-runtime. |
exp_upd.log |
Opened issue #564 but the question still remains open about the discrepancy between build passing locally. For now I have rerun the build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: metalcycling The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Issue link
#477
What changes have been made
In the core AW update method the desire is to provide the
best effort
most recent copy of AW so that if changes like deletion happen to AW, it is reflectedVerification steps
Checks