Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue where cached entry size keeps increasing when making tiny query repeatedly #3968

Merged

Conversation

alvinlin123
Copy link
Contributor

@alvinlin123 alvinlin123 commented Mar 18, 2021

What this PR does:
See my comment at #3920 (comment)

Which issue(s) this PR fixes:
This issue was found as part of #3920, not sure if this PR fixes the issue though.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@alvinlin123 alvinlin123 force-pushed the fix-cache-entry-size-increasing branch from b66e70f to 0cbc752 Compare March 18, 2021 08:49
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on it! I reviewed the code: good job spotting the issue, but I'm not sure this is correct.

The problem is that the extends returned by handleHit() overwrite the cached response. Now think about running a query over "last 24h", then running the same query over "last 1h". With the changes proposed in this PR, the second query on "last 1h" would overwrite the cached extends for the "previous 23h" while we don't want to loose them.

@alvinlin123
Copy link
Contributor Author

alvinlin123 commented Mar 19, 2021

Thanks for working on it! I reviewed the code: good job spotting the issue, but I'm not sure this is correct.

The problem is that the extends returned by handleHit() overwrite the cached response. Now think about running a query over "last 24h", then running the same query over "last 1h". With the changes proposed in this PR, the second query on "last 1h" would overwrite the cached extends for the "previous 23h" while we don't want to loose them.

Spot on, I have another idea, will submit a an update soon.

@alvinlin123 alvinlin123 force-pushed the fix-cache-entry-size-increasing branch from a34a6c7 to eb9cf54 Compare March 20, 2021 05:48
@alvinlin123
Copy link
Contributor Author

alvinlin123 commented Mar 20, 2021

I decided to go with different approach to fix the issue; instead of modifying the []Extent in partition, I made queryrange.handleHit to be able to handle overlapping Extent and queryrange.matrixMerge() be able to deal with overlapping sample points. This approach is probably more resilient to future change in partition and reduces amount of slice allocation.

One thing I am not quite certain is the following code in queryrange.MergeResponses

for _, res := range responses {
		promResponses = append(promResponses, res.(*PrometheusResponse))
		resultsCacheGenNumberHeaderValues = append(resultsCacheGenNumberHeaderValues, getHeaderValuesWithName(res, ResultsCacheGenNumberHeaderName)...)
	}

Is there any special handling needed if the given responses contains overlapping sample points?

Signed-off-by: Alvin Lin <[email protected]>
@pstibrany pstibrany requested a review from gouthamve March 25, 2021 09:31
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on my feedback! Your change looks good to me. I left few nits I would be glad if you could take a look. In the meanwhile I will ping another person expert on the query-frontend for a 2nd look, cause this stuff is tricky :)

func chopOffOverlapPortion(samples []cortexpb.Sample, choppingPointTs int64) []cortexpb.Sample {

// assuming stream.Samples is sorted by by timestamp in ascending order.
searchResult := sort.Search(len(samples), func(i int) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should never happen with the current logic but, to be on the safe side, I would handle the case sort.Search() returns -1 (in this cause I believe we should the input samples).

Copy link
Contributor Author

@alvinlin123 alvinlin123 Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, better to be more defensive. But I took a look at the doc https://golang.org/pkg/sort/#Search sort.Search() never returns -1 it returns [0, n], where if return value is n it means not found.

func main() {

	x := []int{3, 5, 6, 7}
	
        // both prints "4"
	fmt.Println(sort.Search(len(x), func(i int) bool {return x[i] > 100}))  
	fmt.Println(sort.Search(len(x), func(i int) bool {return x[i] < 0})) 
}

https://play.golang.org/p/HBtNWh3_MsR

But I think I get your point, I will add special handling if minTs is smaller than samples[0].

Copy link
Contributor

@owen-d owen-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice investigation and good job on the PR. This code is very tricky (took me a while to reacquaint myself with the code). The fix looks good to me once you address @pracucci's comments. I'm looking forward to revendoring this into Loki :)

@alvinlin123 alvinlin123 force-pushed the fix-cache-entry-size-increasing branch from fb51b56 to 4f3cb5e Compare March 25, 2021 22:51
@alvinlin123
Copy link
Contributor Author

Thanks for review and the comments @pracucci and @owen-d. Yea, this is definitely tricky stuff; I took a trip down memory lane back in my college algorithm classes (10+ years ago) trying to understand these stuff, cool stuff :-)

I address the PR comments, let me know if they look okay.

Fix duplicate CHANGELOG.md entry.

Signed-off-by: Alvin Lin <[email protected]>
@alvinlin123 alvinlin123 force-pushed the fix-cache-entry-size-increasing branch from eda1d61 to 68890c4 Compare March 25, 2021 23:04
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙈

@pracucci pracucci merged commit 3588c55 into cortexproject:master Mar 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants