Skip to content

Investigating the reason out of order samples are appearing in LTS blocks #4724

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shybbko opened this issue Apr 25, 2022 · 2 comments
Closed
Labels

Comments

@shybbko
Copy link

shybbko commented Apr 25, 2022

I'm running two config-identical GKE Cortex clusters (Cortex 1.11, GKE 1.21). Those are aggregating logs from several sources, you could say we're talking nonprod & prod cluster. Each cluster holds several TB of metrics.
In nonprod cluster I have 0 blocks with out of order series.
In prod cluster - over 20.

I've only noticed this upon those blocks being skipped from compaction due to containing OoO samples.

So far the out of order blocks are "grouped" around seven timestamps between January and March. The clusters have been online for much longer, but between December and January I was migrating from chunks to blocks (not sure whether relevant).
The faulty blocks are either "even" (ie. 4 to 6 pm) or not (ie. 4:01:13 to 6 pm).
So far I don't recall any particular events taking place around those timestamps.
Each faulty block contains between 1 and 13 out of order series.
Also I don't see any obvious correlation between those OoO samples (like job, source node etc.), they appear random.

I wanted to investigate the reasons for:

  • the blocks with OoO series appearing at all
  • the blocks appearing in the "prod" cluster only

So far I came up with three possible options, but cannot neither confirm nor deny any of them being the cause:

Any ideas, hints, suggestions? Ideally I wanted to assure there is not a single new block containing out of order series in the future.

@alanprot
Copy link
Member

alanprot commented May 4, 2022

We see some out of orders samples as well sometimes and its not super easy to root cause the problem.

We are currently testing prometheus/prometheus#10624 to see if it will fix the issue on our case.

@stale
Copy link

stale bot commented Aug 12, 2022

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Aug 12, 2022
@stale stale bot closed this as completed Oct 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants