Skip to content

Revert "Let forward requests run until timeout (#2679) + style fixes. #2771

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

bwplotka
Copy link
Member

@bwplotka bwplotka commented Jun 16, 2020

Let's revert the commit in the path of the problematic handler we saw on production.

I think the PR #2679 made sense, we want to replicate 3x, just 2x strictly and 3rd one best effort.
While it makes sense logically, there are some reasons why the receive could be saturated because of this change e.g

  • If there is single slow writer, we would have to start new connections and eventually run out of connections.
  • We still use extra bandwidth, because of requests left running behind.
  • We put more pressure on TSDBs (have more concurrent appends)

Essentially with this change indeed we have "leaking" rate-limiting (if there is any;p), because we claim request end, despite things are still processing.

Anyway, let's build an image from this branch & deploy it and let's see if we can repro. There are solid reasons why this might attribute for potential saturation

@bwplotka
Copy link
Member Author

Let's NOT merge until we test this out.

@bwplotka
Copy link
Member Author

We should have revert-receive-cancel-2020-06-16-88f0440d image on quay now

@bwplotka bwplotka marked this pull request as draft June 16, 2020 06:23
@bwplotka bwplotka requested review from kakkoyun, brancz and squat June 16, 2020 06:23
Copy link
Member

@squat squat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is what we want, with this change we'll again see high error rates even when 2 writes succeed and the other is still in flight. In other words this is okayish for testing the theory (though we might confuse the source of errors), but this is not code we should merge

@bwplotka
Copy link
Member Author

I don't think error rates are the problem here - we can easily change the instrumentation to properly report those in separate PR. It's better than not being able to handle the load (:

@squat
Copy link
Member

squat commented Jun 16, 2020

Yes of course we can change the instrumentation, and if we choose to go this route then I think we need to do that before merging the PR since it will otherwise conflate real errors with expected cancellations

@bwplotka
Copy link
Member Author

bwplotka commented Jun 16, 2020

How mergin this in this state is worse than not being able to use master at all? (:

@bwplotka
Copy link
Member Author

but yea agree, something to fix ASAP, let's deploy this first and check

@bwplotka
Copy link
Member Author

Looks like after all it's isolation fix not this, closing.

@bwplotka bwplotka closed this Jun 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants