Investigating the reason out of order samples are appearing in LTS blocks #4724

shybbko · 2022-04-25T15:40:47Z

I'm running two config-identical GKE Cortex clusters (Cortex 1.11, GKE 1.21). Those are aggregating logs from several sources, you could say we're talking nonprod & prod cluster. Each cluster holds several TB of metrics.
In nonprod cluster I have 0 blocks with out of order series.
In prod cluster - over 20.

I've only noticed this upon those blocks being skipped from compaction due to containing OoO samples.

So far the out of order blocks are "grouped" around seven timestamps between January and March. The clusters have been online for much longer, but between December and January I was migrating from chunks to blocks (not sure whether relevant).
The faulty blocks are either "even" (ie. 4 to 6 pm) or not (ie. 4:01:13 to 6 pm).
So far I don't recall any particular events taking place around those timestamps.
Each faulty block contains between 1 and 13 out of order series.
Also I don't see any obvious correlation between those OoO samples (like job, source node etc.), they appear random.

I wanted to investigate the reasons for:

the blocks with OoO series appearing at all
the blocks appearing in the "prod" cluster only

So far I came up with three possible options, but cannot neither confirm nor deny any of them being the cause:

Prometheus race condition issue? Potential Race condition when compacting head and receiving samples at the same time prometheus/prometheus#9879 + Compactor Failing to compact blocks as it contains out of orders samples. #4573
Cortex is rejecting some out of order samples on daily basis and IMO this number is significant (at least higher than I wanted to to be). While Cortex is designed not to let trough any OoO sample, maybe it sometimes just slips one? In prod I've got about 500M OoO samples rejected total (around 0.02% of traffic), in nonprod it's 5M total (around 0.0001%).
Not clean deployment / rollout in K8s. By "not clean" I mean Cortex ingesters might be getting forcibly shut down and writing metrics into LTS in a inproper manner.

Any ideas, hints, suggestions? Ideally I wanted to assure there is not a single new block containing out of order series in the future.

alanprot · 2022-05-04T22:12:36Z

We see some out of orders samples as well sometimes and its not super easy to root cause the problem.

We are currently testing prometheus/prometheus#10624 to see if it will fix the issue on our case.

stale · 2022-08-12T01:09:09Z

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

stale bot added the stale label Aug 12, 2022

stale bot closed this as completed Oct 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigating the reason out of order samples are appearing in LTS blocks #4724

Investigating the reason out of order samples are appearing in LTS blocks #4724

shybbko commented Apr 25, 2022

alanprot commented May 4, 2022

stale bot commented Aug 12, 2022

Investigating the reason out of order samples are appearing in LTS blocks #4724

Investigating the reason out of order samples are appearing in LTS blocks #4724

Comments

shybbko commented Apr 25, 2022

alanprot commented May 4, 2022

stale bot commented Aug 12, 2022