Current cortex_compactor_blocks_marked_for_no_compaction_total value is lost upon redeployment #4727

shybbko · 2022-05-04T12:10:13Z

Describe the bug
Cortex Compactor, upon pod redeployment, loses the current value of cortex_compactor_blocks_marked_for_no_compaction_total metric.

This might or might not be affected by the fact, that currently I'm running Cortex 1.11.0 deployed via Cortex Helm Chart 1.4.0 with all bells and whistles (caches, key stores etc) but without Cortex Compactor pod. It's deployed separately and with minimum configuration possible. It's running cortex:master-bb6b026 version in order to incorporate 4d751f2 which introduced fix to compaction process which was blocking compaction in my env.

To Reproduce
Steps to reproduce the behavior:

Deploy Cortex
Start compactions
Wait until some blocks are marked as no compaction and the metric cortex_compactor_blocks_marked_for_no_compaction_total starts showing value >0
Redeploy the whole Cortex or Compactor pods only
The metric shows 0 now (until new no compaction block is encountered)

Expected behavior
Current cortex_compactor_blocks_marked_for_no_compaction_total value is not lost upon Cortex redeployment

Environment:
GKE 1.21
Cortex 1.11.0 deployed via Cortex Helm Chart 1.4.0 (without compactor pod)
Cortex Compactor cortex:master-bb6b026 deployed separately

Storage Engine
Blocks

Additional Context
n/a

The text was updated successfully, but these errors were encountered:

alanprot · 2022-05-04T22:09:34Z

Expected behavior
Current cortex_compactor_blocks_marked_for_no_compaction_total value is not lost upon Cortex redeployment

This is not a brand new metric introduced on 4d751f2?

shybbko · 2022-05-04T22:12:26Z

This is not a brand new metric introduced on 4d751f2?

That's correct. I'm running compactor version that incorporates this commit.

alanprot · 2022-05-04T22:15:38Z

So there is no current behaviour that was changed on that commit, right?

This metric is a counter and.. starting again in 0 is expected i guess.

You may wanna do something like sum(rate(cortex_compactor_blocks_marked_for_no_compaction_total[5m])) or sum(delta(cortex_compactor_blocks_marked_for_no_compaction_total[5m]))?

shybbko · 2022-05-04T22:29:58Z

So there is no current behaviour that was changed on that commit, right?

The way I think of it - the metric was added, but there is a chance an edge case of "what happens when we gotta create a new set of pods with an already existing storage that had some blocks marked as no deletion" was not thought of. But I might be looking at the design from the wrong perspective?

This metric is a counter and.. starting again in 0 is expected i guess.

As I said, I might be in the wrong here, but the way I see it is the metric should show the total count of all blocks that are in the storage and were marked as no deletion. Instead it seems (either purposely or by design) to show the current count of no compaction blocks for the current set of pods. Considering it's in the nature of K8s to replace pods I think that the latter is not very useful? Similarly, upon redeployment Cortex properly shows the current block count (instead of counting only newly created blocks).

You may wanna do something like sum(rate(cortex_compactor_blocks_marked_for_no_compaction_total[5m])) or sum(delta(cortex_compactor_blocks_marked_for_no_compaction_total[5m]))?

Well, that doesn't work. Sharing a screenshot below, the first query is the one I use in my dashboards:

shybbko · 2022-05-04T22:35:04Z

I could run something like increase(cortex_compactor_blocks_marked_for_no_compaction_total[10y]) to ignore the drops, but that doesn't seem like a good query design then :D

alvinlin123 · 2022-05-04T22:41:51Z

Hi @shybbko

I can't really see the graph for sum(rate(cortex_compactor_blocks_marked_for_no_compaction_total[5m])) in the screen shot you provided.

But would you mind trying changing 5m to 1m in the query? I am assuming your scrape interval is 1m here :)

Also set the step of the query_range to 1m as well.

alanprot · 2022-05-04T23:02:57Z

Seems we want a gauge not a counter here if i'm understanding this correctly, right? A metric that shows the state of my storage RIGHT NOW (In other words, how many blocks are marked to skip on my entire storage). The problem is that this metric is created by Thanos.

https://github.com/thanos-io/thanos/blob/7cf9309695bba23dabdf39ecc595403965625db5/pkg/block/block.go#L397

What is the goal of that metric? Are you able to recover those blocks and so you can have the metric always 0?

shybbko · 2022-05-04T23:04:39Z

Hi @alvinlin123 !

I can't really see the graph for sum(rate(cortex_compactor_blocks_marked_for_no_compaction_total[5m])) in the screen shot you provided.

With the range of 30 days and max Y axis value being 25 it's almost invisible with the top value of around 0.008, but it's there. It's just a matter of screenshot resolution, apologies.

But would you mind trying changing 5m to 1m in the query?
Also set the step of the query_range to 1m as well.

There you go (the range was cut to the relevant period, so it's different from the previous screenshot):

I am assuming your scrape interval is 1m here :)

Quite honestly I don't know which flag in Cortex configures self-scrape interval?

shybbko · 2022-05-04T23:09:49Z

Seems we want a gauge not a counter here if i'm understanding this correctly, right? A metric that shows the state of my storage RIGHT NOW (In other words, how many blocks are marked to skip on my entire storage). The problem is that this metric is created by Thanos.

What is the goal of that metric? Are you able to recover those blocks and so you can have the metric always 0?

I can work with the current design and something like increase(cortex_compactor_blocks_marked_for_no_compaction_total[10y]), although I don't think it's right. Anyway I think it's you guys that should decide whether this works as designed (so it should show the current "effect" of compactor efforts) or should be redesigned to show the total count of no compaction blocks in storage.

What I personally need is the latter, as I am interested in knowing how many blocks I have are (because of various reasons) imperfect and marked as no compaction. Based on that I can later investigate them. So far the no compaction blocks I've got I am unable to make compaction-ready, as they contain out of order samples. But that's a different story.

alvinlin123 · 2022-05-05T01:15:35Z

Alright I think I misunderstood the issue since the beginning.

cortex_compactor_blocks_marked_for_no_compaction_total is a metric that is supposed to carry the information of how many new blocks were marked for no compact and the reason why it was marked for no compact. I think what you want is a metric to tell you how many block are currently marked for no compact; regardless it's newly marked or previously marked.

So, I think we will need to create a new metric to keep track of how many blocks are currently marked for no compact.

alanprot · 2022-05-05T03:54:05Z

Hi @shybbko ,

Can you test if this PR fix your issue? #4729

You will need to set block_deletion_marks_migration_enabled if you wanna backfill the blocks that already have the marker.

The metric name should be cortex_compactor_blocks_marked_for_no_compaction_on_storage_total and its is a gauge now.

shybbko · 2022-05-05T12:20:02Z

Hi @shybbko ,

Can you test if this PR fix your issue? #4729

What does it take to test it? The bare minimum? Compactor only, judging by the code? What I mean is that I could prepare a simple test env and then build an image based on this branch in order to test a single pod. However I am afraid whether it would take more resources?

alanprot · 2022-05-05T13:32:00Z

This should not take any more resource … it is just counting how many no compaction markers is found on the same loop that already exists to build the bucket index.

shybbko · 2022-05-05T13:39:03Z

So deploying compactor pod only is fine here, correct?

alanprot · 2022-05-05T13:43:43Z

Yeah.. it should work..

shybbko · 2022-05-06T09:59:19Z

Seems to work fine. One remark though: with block_deletion_marks_migration_enabled set to both true and false I'm getting the same (correct) count of no compaction blocks in storage:

curl -s 127.0.0.1:9001/metrics | grep cortex_compactor_blocks_marked | grep -v "#"
cortex_compactor_blocks_marked_for_deletion_total{reason="compaction"} 0
cortex_compactor_blocks_marked_for_deletion_total{reason="retention"} 0
cortex_compactor_blocks_marked_for_no_compaction_on_storage_total{user="fake"} 22
cortex_compactor_blocks_marked_for_no_compaction_total 0

alanprot · 2022-05-06T13:03:18Z

Nice.. we only need that on first time to migrate the marks to the global folder

alanprot · 2022-05-06T13:05:43Z

It would be nice to have the reason as well I guess …

shybbko · 2022-05-06T13:09:07Z

It would be nice to have the reason as well I guess …

Not sure if I get you...?

alanprot · 2022-05-06T13:14:05Z

The reason label ..

reason="compaction"

for that we would need to read the markers.. we can do after if needed I guess

shybbko · 2022-05-06T13:17:55Z

Oh, that. For my needs that's not that relevant, but I do agree it would be a fine addition.

alanprot · 2022-05-06T13:32:09Z

I think now the question is .. should we keep both metrics? If so what’s should be the name of this new metric ?

@alvinlin123 thoughts ?

Ps: this new metric is only available if bucket index is enabled .. I think is fine

alanprot · 2022-05-06T19:47:19Z

Ok.. I open the PR..

I changed the metric name to be aligned with the other one we have for the deletion marks:

cortex_bucket_blocks_marked_for_deletion_count
cortex_bucket_blocks_marked_for_no_compaction_count

shybbko · 2022-05-06T20:16:08Z

Great! I appreciate the effort!

shybbko changed the title ~~Current cortex_compactor_blocks_marked_for_no_compaction_total value is not lost upon Cortex redeployment~~ Current cortex_compactor_blocks_marked_for_no_compaction_total value is lost upon redeployment May 4, 2022

alanprot mentioned this issue May 5, 2022

Upload blocks no compaction marks to the global location and introduce a new cortex_bucket_blocks_marked_for_no_compaction_count metric #4729

Merged

2 tasks

alvinlin123 closed this as completed in #4729 May 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Current cortex_compactor_blocks_marked_for_no_compaction_total value is lost upon redeployment #4727

Current cortex_compactor_blocks_marked_for_no_compaction_total value is lost upon redeployment #4727

shybbko commented May 4, 2022

alanprot commented May 4, 2022

shybbko commented May 4, 2022

alanprot commented May 4, 2022 •

edited

Loading

shybbko commented May 4, 2022 •

edited

Loading

shybbko commented May 4, 2022

alvinlin123 commented May 4, 2022 •

edited

Loading

alanprot commented May 4, 2022 •

edited

Loading

shybbko commented May 4, 2022

shybbko commented May 4, 2022

alvinlin123 commented May 5, 2022

alanprot commented May 5, 2022 •

edited

Loading

shybbko commented May 5, 2022

alanprot commented May 5, 2022

shybbko commented May 5, 2022

alanprot commented May 5, 2022

shybbko commented May 6, 2022

alanprot commented May 6, 2022

alanprot commented May 6, 2022

shybbko commented May 6, 2022

alanprot commented May 6, 2022 •

edited

Loading

shybbko commented May 6, 2022

alanprot commented May 6, 2022 •

edited

Loading

alanprot commented May 6, 2022 •

edited

Loading

shybbko commented May 6, 2022

Current cortex_compactor_blocks_marked_for_no_compaction_total value is lost upon redeployment #4727

Current cortex_compactor_blocks_marked_for_no_compaction_total value is lost upon redeployment #4727

Comments

shybbko commented May 4, 2022

alanprot commented May 4, 2022

shybbko commented May 4, 2022

alanprot commented May 4, 2022 • edited Loading

shybbko commented May 4, 2022 • edited Loading

shybbko commented May 4, 2022

alvinlin123 commented May 4, 2022 • edited Loading

alanprot commented May 4, 2022 • edited Loading

shybbko commented May 4, 2022

shybbko commented May 4, 2022

alvinlin123 commented May 5, 2022

alanprot commented May 5, 2022 • edited Loading

shybbko commented May 5, 2022

alanprot commented May 5, 2022

shybbko commented May 5, 2022

alanprot commented May 5, 2022

shybbko commented May 6, 2022

alanprot commented May 6, 2022

alanprot commented May 6, 2022

shybbko commented May 6, 2022

alanprot commented May 6, 2022 • edited Loading

shybbko commented May 6, 2022

alanprot commented May 6, 2022 • edited Loading

alanprot commented May 6, 2022 • edited Loading

shybbko commented May 6, 2022

alanprot commented May 4, 2022 •

edited

Loading

shybbko commented May 4, 2022 •

edited

Loading

alvinlin123 commented May 4, 2022 •

edited

Loading

alanprot commented May 4, 2022 •

edited

Loading

alanprot commented May 5, 2022 •

edited

Loading

alanprot commented May 6, 2022 •

edited

Loading

alanprot commented May 6, 2022 •

edited

Loading

alanprot commented May 6, 2022 •

edited

Loading