Skip to content

QueryFrontend sets incorrect status code on HTTP timeout #2483

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 20, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
* [BUGFIX] Ring: Fixed a situation where upgrading from pre-1.0 cortex with a rolling strategy caused new 1.0 ingesters to lose their zone value in the ring until manually forced to re-register. #2404
* [BUGFIX] Distributor: `/all_user_stats` now show API and Rule Ingest Rate correctly. #2457
* [BUGFIX] Fixed `version`, `revision` and `branch` labels exported by the `cortex_build_info` metric. #2468
* [BUGFIX] QueryFrontend: fixed a situation where HTTP error is ignored and an incorrect status code is set. #2483

## 1.0.0 / 2020-04-02

Expand Down
6 changes: 5 additions & 1 deletion pkg/querier/frontend/frontend.go
Original file line number Diff line number Diff line change
Expand Up @@ -176,7 +176,11 @@ func (f *Frontend) handle(w http.ResponseWriter, r *http.Request) {
hs[h] = vs
}
w.WriteHeader(resp.StatusCode)
io.Copy(w, resp.Body)

if _, err = io.Copy(w, resp.Body); err != nil {
server.WriteError(w, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't understand how this could could work. It will write the HTTP header but it has already been written above. Am I missing anything @dmitsh?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

server.WriteError calls w.WriteHeader as well https://github.com/weaveworks/common/blob/8f725fc52d8defd2b0f7eecab55a86f33421c08f/httpgrpc/server/server.go#L163-L175 which I thought overrides the previous header set.

But I was wrong:

	// WriteHeader sends an HTTP response header with the provided
	// status code.
	//
	// If WriteHeader is not called explicitly, the first call to Write
	// will trigger an implicit WriteHeader(http.StatusOK).
	// Thus explicit calls to WriteHeader are mainly used to
	// send error codes.
	//
	// The provided code must be a valid HTTP 1xx-5xx status code.
	// Only one header may be written. Go does not currently
	// support sending user-defined 1xx informational headers,
	// with the exception of 100-continue response header that the
	// Server sends automatically when the Request.Body is read.

	WriteHeader(statusCode int)

We can only call WriteHeader once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a chicken and egg situation. We have to set the status before calling Write, but the Write can fail.
With this change, server.WriteError will override the status, and we are logging the correct information:

level=warn ts=2020-04-20T14:37:20.70899Z caller=logging.go:49 traceID=29b2da71f9d67c9b msg="GET /api/v1/query_range?&query=custom_metric_2000&start=2020-04-20T06%3A00%3A00-07%3A00&end=2020-04-20T06%3A30%3A00-07%3A00&step=10 (500) 32.354122946s Response: \"write tcp [::1]:9090->[::1]:56587: i/o timeout\\n\" ws: false; Accept: */*; User-Agent: curl/7.64.1; "

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify:
Before this PR, if io.Copy() has failed, cortex was returning an empty response (probably because the write was aborted in the middle). But the log was indicating successful status code (200).
With my PR, cortex still returns an empty response if io.Copy() fails (for the same reason), but we have the correct status code (500) and the error message in the log and metrics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, just went through the code again. The problem is io.Copy will fail if w.Write fails. And we shouldn't try to attempt a w.Write again (which is what server.WriteError(w, err) does internally). I am not sure why the log is printing 200, I would say this is a logging middleware issue and this is not the right fix. But I haven't looked at the logging middleware yet, will do now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/weaveworks/common/blob/master/middleware/logging.go#L40-L42

This is only because we had a bad implementation of the logging middleware. https://github.com/weaveworks/common/blob/master/middleware/response.go#L42-L54 <-- We should ideally capture the error here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine the most likely scenario for Write() to fail is that the client has disconnected.
In that case the context should be cancelled; I tried to detect and log those in weaveworks/common#180 but perhaps ineffectively.

Note it is somewhat unwarranted to count a user action as a Cortex failure - if you're trying to apply an SLI you don't want them lumped in together. We could do like Nginx, and count them as code 499.

Copy link
Contributor Author

@dmitsh dmitsh Apr 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An easy way to reproduce this problem is to set http_server_write_timeout to a short duration (1 sec?), and then invoke HTTP request. If it takes more than 1 second to get the data, the request will time out, and io.Copy() will fail.

return
}
}

func writeError(w http.ResponseWriter, err error) {
Expand Down