-
Notifications
You must be signed in to change notification settings - Fork 814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicated sample points returned by query-frontend #3920
Comments
Could you clarify whether you are using the |
Yes |
I tried query all the queries directly after I see the bug through |
One more data point. I was able to reproduce the issue by querying agains
Below are the curl command I ran, port 7777 is to querier where as port 9999 is to a single query-frontend pod Click to see the long outputs▶ curl --location -g --request POST "http://localhost:9999/prometheus/api/v1/query_range?query=my_metrics&start=1615161360&end=1615161720&step=60s" -H "X-Scope-OrgId: tenant_id" -d '{}' -v | jq
{
"status": "success",
"data": {
"resultType": "matrix",
"result": [
{
"metric": {
"__name__": "my_metrics"
},
"values": [
[
1615161360,
"242"
],
[
1615161420,
"29"
],
[
1615161480,
"29"
],
[
1615161540,
"29"
],
[
1615161600,
"29"
],
[
1615161660,
"29"
],
[
1615161720,
"73"
],
[
1615161660,
"29"
],
[
1615161720,
"73"
]
]
}
]
}
}
▶ curl --location -g --request POST "http://localhost:7777/prometheus/api/v1/query_range?query=my_metrics&start=1615161360&end=1615161720&step=60s" -H "X-Scope-OrgId: tenant_id" -d '{}' -v | jq
{
"status": "success",
"data": {
"resultType": "matrix",
"result": [
{
"metric": {
"__name__": "my_metrics"
},
"values": [
[
1615161360,
"242"
],
[
1615161420,
"29"
],
[
1615161480,
"29"
],
[
1615161540,
"29"
],
[
1615161600,
"29"
],
[
1615161660,
"29"
],
[
1615161720,
"73"
]
]
}
]
}
}
▶ kill -3 $(pgrep -f '^kubectl.* port-forward .* 7777') ➜ AWSPrometheusOpsTools git:(mainline) [2021-03-09 17:25:31] AWS:
▶ curl --location -g --request POST "http://localhost:7777/prometheus/api/v1/query_range?query=my_metrics&start=1615161360&end=1615161720&step=60s" -H "X-Scope-OrgId: tenant_id" -d '{}' -v | jq
{
"status": "success",
"data": {
"resultType": "matrix",
"result": [
{
"metric": {
"__name__": "my_metrics"
},
"values": [
[
1615161360,
"242"
],
[
1615161420,
"29"
],
[
1615161480,
"29"
],
[
1615161540,
"29"
],
[
1615161600,
"29"
],
[
1615161660,
"29"
],
[
1615161720,
"73"
]
]
}
]
}
}
▶ kill -3 $(pgrep -f '^kubectl.* port-forward .* 7777') ➜ AWSPrometheusOpsTools git:(mainline) [2021-03-09 17:25:40]
▶ curl --location -g --request POST "http://localhost:7777/prometheus/api/v1/query_range?query=my_metrics&start=1615161360&end=1615161720&step=60s" -H "X-Scope-OrgId: tenant_id" -d '{}' -v | jq
{
"status": "success",
"data": {
"resultType": "matrix",
"result": [
{
"metric": {
"__name__": "my_metrics"
},
"values": [
[
1615161360,
"242"
],
[
1615161420,
"29"
],
[
1615161480,
"29"
],
[
1615161540,
"29"
],
[
1615161600,
"29"
],
[
1615161660,
"29"
],
[
1615161720,
"73"
]
]
}
]
}
}
▶ curl --location -g --request POST "http://localhost:9999/prometheus/api/v1/query_range?query=my_metrics&start=1615161360&end=1615161720&step=60s" -H "X-Scope-OrgId: tenant_id" -d '{}' -v | jq
{
"status": "success",
"data": {
"resultType": "matrix",
"result": [
{
"metric": {
"__name__": "my_metrics"
},
"values": [
[
1615161360,
"242"
],
[
1615161420,
"29"
],
[
1615161480,
"29"
],
[
1615161540,
"29"
],
[
1615161600,
"29"
],
[
1615161660,
"29"
],
[
1615161720,
"73"
],
[
1615161660,
"29"
],
[
1615161720,
"73"
]
]
}
]
}
}
|
From the Cortex community call, wondering if this would be in or around https://github.com/cortexproject/cortex/blob/master/pkg/querier/queryrange/split_by_interval.go#L43 |
ok, one more datapoint I just discovered, if I set |
I just noticed something weird. With old commit af9e20c the size of cache entry doesn't change. But after I upgrade to commit dd6dbf9, I started to see cache item size start to grow everytime I make a same query:
note that when The range query I am making has start=1614556800 end=1614556860, which crosses a day's boundary in GMT. With the above in mind, I was somehow able to reproduce the duplicate sample issue if I down grade from newer commit to older commit; and I guess that is because the cached entry contains lots of duplicate samples? |
We noticed another issue with the behaviour introduced by 50ab740 and Goutham fixed it in #3818. Could you test if #3818 solves your issue too? I just would like to exclude the issue has already be (unintentionally) fixed in |
Ok, so after looking closer, indeed 50ab740 caused the issue where cache entry size kept on increasing. Consider the following: We are making a range query where
Because the above timestamp crosses day's boundary, so the query splitter will split it to two queries:
Note that all the queries has interval less than 5 minutes. So these 2 queries will get pass further down to the Both query will go through the Because the queries are less than 5 minutes this line of code will not use the cache data, and will return an
Before 50ab740, I think it was expected that I am working on a PR to fix this. |
What is left to fix here after #3968? |
I am not seeing anymore duplicate samples pointing now, so I will go ahead and resolve this issue now, but I am still now sure how it was fixed :) Something must have changed with de-duplication logic together with #3968 that fixes this duplicate sample issue. |
Describe the bug
I have continuous test running that queries Cortex. After upgrading from commit af9e20c to dd6dbf9 we observed that around 16:00:00 pacific time our test would fail because duplicate sample points were returned:
Notice how sample point with timestamp
1614556800
and1614556860
are duplicated.Our continuous is quite simple, it runs periodically to push some data and query some data from Cortex. Our test is not using HA mode.
To Reproduce
I am not sure how to reproduce it, but one pattern is that it seems to happen everyday around 16:00:00 pacific time (00:00:00 GMT). The query I had issues with are queries that query past data points like:
Also, if I keep making the same query multiple times, the issues goes away after a few tries for that specific query; problem comes back next day at 16:00:00 pacific time.
Expected behavior
I don't expected sample points to be duplicated because:
Environment:
Storage Engine
The text was updated successfully, but these errors were encountered: