-
Notifications
You must be signed in to change notification settings - Fork 182
Fix bug with Watch and 410 retries #227
Fix bug with Watch and 410 retries #227
Conversation
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Welcome @chrisayoub! |
@chrisayoub: GitHub didn't allow me to assign the following users: mbohlool. Note that only kubernetes-client members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. |
/assign @micw523 |
if timeout or stop, retry should not be attempted? i think the current behavior is expected. |
@yliaog If you look at the So for clients that specify a timeout as an argument to the function, the result of this is that when a 410 error occurs, the caller gets no exception at all and it appears that the watch has gracefully ended, which makes it impossible for a caller to recover from 410. |
i see. in that case, timeouts condition is not useful. but it should still break without retry when stop is true. so what about the following: if self._stop: |
This is the previous PR and issue where this Do you think that change would conflict with this? |
Yeah, removing the timeouts condition would break functionality. If that were set, a user might end up having a watch which lasts forever despite having set a timeout on it explicitly. Thinking about this more, I actually think the best fix would be to never do any type of retry when Additionally, I think maybe we should make that intention more clear so that similar bugs don't get introduced in the future. Before the while loop, we could set a variable like: disable_retries = timeouts is not None and then alter this condition to be: if self._stop or disable_retries:
break and change the condition on line 169 to: if not disable_retries and not retry_after_410 and obj['code'] == HTTP_STATUS_GONE: |
@PaulFurtado I updated the PR with your suggested changes, and I tested and they seem to work correctly for me. @yliaog can you take another look? Thanks! |
i'm not sure how the PR solve the original problem as given in the PR description below, so when there is 410 error, and timeout is given, there will be no retry at all, it returns. "When there is a 410 error that needs to be retried and the user specifics any timeout values (timeouts), the code currently returns, when it really should proceed to the next iteration of the for loop and actually do the retry." |
I think this is just a question of the desired behavior ultimately. When the user specifies a timeout value and we encounter an event of type Ultimately, the desired fix is that under these circumstances, we do not simply return from the function, and we either yield the event or raise an Exception. |
@yliaog by adding the |
@@ -166,7 +166,7 @@ def stream(self, func, *args, **kwargs): | |||
obj = event['raw_object'] | |||
# Current request expired, let's retry, | |||
# but only if we have not already retried. | |||
if not retry_after_410 and \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a comment about the added condition "not disable_retries and not"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, see lines 154-155 as well
looks good to disable retry for the 410 case, could you please add a test case for it? |
# No events are generated when no initial resourceVersion is passed | ||
# No retry is attempted either, preventing an ApiException | ||
assert not list(w.stream(fake_api.get_thing)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the existing test case, which was not actually correct for the current version of the code with retries. Additionally, I added 2 more test cases to test both code paths that my PR is related to.
|
||
w = Watch() | ||
try: | ||
for _ in w.stream(fake_api.get_thing, timeout_seconds=10): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no resource_version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For testing this code path and condition, you do not actually need to supply resource_version
here, as an exception is raised and only the code in the watch finally
block will execute. resource_version
is needed in the other test because the code path goes beyond the finally
block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
thanks for the pr. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: chrisayoub, yliaog The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@yliaog it looks like the failed CI run is preventing this from automatically merging: However, this test case isn't even in this repo? Am I supposed to do anything else to proceed further in getting this merged? Thank you! |
closing the pr, and reoopen to trigger another CI run |
/close |
@yliaog: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/open |
/reopen |
@yliaog: Reopened this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There is a bug that was introduced with the following PR:
#133
When there is a 410 error that needs to be retried and the user specifics any timeout values (
timeouts
), the code currently returns, when it really should proceed to the next iteration of the for loop and actually do the retry.