Skip to content

Make store-gateway mandatory when running the blocks storage #2822

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
465685a
Removed blocks bucket store support from querier
pracucci Jul 1, 2020
f32b10b
Updated documentation
pracucci Jul 1, 2020
04ae480
Added PR number to CHANGELOG entry
pracucci Jul 1, 2020
6a6dc19
Update docs/operations/blocks-storage.template
pracucci Jul 2, 2020
a5abb21
Update docs/operations/blocks-storage.template
pracucci Jul 2, 2020
ad5dcda
Update docs/operations/blocks-storage.template
pracucci Jul 2, 2020
ea2da23
Update docs/operations/blocks-storage.template
pracucci Jul 2, 2020
ff6d094
Update docs/operations/blocks-storage.template
pracucci Jul 2, 2020
7c9ef1d
Merge branch 'master' into remove-bucket-store-support-from-querier
pracucci Jul 2, 2020
d593390
Updated doc
pracucci Jul 2, 2020
ff5bd87
Update docs/operations/blocks-storage.template
pracucci Jul 2, 2020
2c89322
Updated doc
pracucci Jul 2, 2020
2cb62a2
Update docs/operations/blocks-storage.template
pracucci Jul 2, 2020
26839f8
Update docs/operations/blocks-storage.template
pracucci Jul 2, 2020
057d581
Update docs/operations/blocks-storage.template
pracucci Jul 2, 2020
e6f9baa
Update docs/operations/blocks-storage.template
pracucci Jul 2, 2020
9cf141a
Update docs/operations/blocks-storage.template
pracucci Jul 2, 2020
6f642e8
Update docs/operations/blocks-storage.template
pracucci Jul 2, 2020
48bbb2a
Update docs/operations/blocks-storage.template
pracucci Jul 2, 2020
ff65879
Update docs/operations/blocks-storage.template
pracucci Jul 2, 2020
b374e39
Update docs/operations/blocks-storage.template
pracucci Jul 2, 2020
f39c51c
Updated doc
pracucci Jul 2, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
* `cortex_bucket_stores_gate_duration_seconds`
* [CHANGE] Metric `cortex_ingester_flush_reasons` has been renamed to `cortex_ingester_series_flushed_total`, and is now incremented during flush, not when series is enqueued for flushing. #2802
* [CHANGE] Experimental Delete Series: Metric `cortex_purger_oldest_pending_delete_request_age_seconds` would track age of delete requests since they are over their cancellation period instead of their creation time. #2806
* [CHANGE] Experimental TSDB: the store-gateway service is required in a Cortex cluster running with the experimental blocks storage. Removed the `-experimental.tsdb.store-gateway-enabled` CLI flag and `store_gateway_enabled` YAML config option. The store-gateway is now always enabled when the storage engine is `tsdb`. #2822
* [CHANGE] Ingester: Chunks flushed via /flush stay in memory until retention period is reached. This affects `cortex_ingester_memory_chunks` metric. #2778
* [FEATURE] Introduced `ruler.for-outage-tolerance`, Max time to tolerate outage for restoring "for" state of alert. #2783
* [FEATURE] Introduced `ruler.for-grace-period`, Minimum duration between alert and restored "for" state. This is maintained only for alerts with configured "for" time greater than grace period. #2783
Expand Down
2 changes: 0 additions & 2 deletions development/tsdb-blocks-storage-s3/config/cortex.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,6 @@ querier:
store_gateway_addresses: store-gateway-1:9008,store-gateway-2:9009

tsdb:
store_gateway_enabled: true

dir: /tmp/cortex-tsdb-ingester
ship_interval: 1m
block_ranges_period: [ 2h ]
Expand Down
9 changes: 2 additions & 7 deletions docs/configuration/config-file-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -2773,8 +2773,8 @@ The `tsdb_config` configures the experimental blocks storage.
[block_ranges_period: <list of duration> | default = 2h0m0s]

# TSDB blocks retention in the ingester before a block is removed. This should
# be larger than the block_ranges_period and large enough to give queriers
# enough time to discover newly uploaded blocks.
# be larger than the block_ranges_period and large enough to give store-gateways
# and queriers enough time to discover newly uploaded blocks.
# CLI flag: -experimental.tsdb.retention-period
[retention_period: <duration> | default = 6h]

Expand Down Expand Up @@ -3063,11 +3063,6 @@ bucket_store:
# CLI flag: -experimental.tsdb.wal-compression-enabled
[wal_compression_enabled: <boolean> | default = false]

# True if the Cortex cluster is running the store-gateway service and the
# querier should query the bucket store via the store-gateway.
# CLI flag: -experimental.tsdb.store-gateway-enabled
[store_gateway_enabled: <boolean> | default = false]

# If true, and transfer of blocks on shutdown fails or is disabled, incomplete
# blocks are flushed to storage instead. If false, incomplete blocks will be
# reused after restart, and uploaded when finished.
Expand Down
66 changes: 44 additions & 22 deletions docs/operations/blocks-storage.md

Large diffs are not rendered by default.

57 changes: 42 additions & 15 deletions docs/operations/blocks-storage.template
Original file line number Diff line number Diff line change
Expand Up @@ -22,32 +22,59 @@ The rest of the document assumes you have read the [Cortex architecture](../arch

When the blocks storage is used, each **ingester** creates a per-tenant TSDB and ships the TSDB Blocks - which by default are cut every 2 hours - to the long-term storage.

**Queriers** periodically iterate over the storage bucket to discover recently uploaded Blocks and - for each Block - download a subset of the block index - called "index header" - which is kept in memory and used to provide fast lookups.
The **store-gateways** periodically iterate over the storage bucket to discover recently uploaded Blocks, and for each block they download a subset of the block index (index-header) which is kept in memory and used to provide fast lookups at query time.

**Queriers** periodically iterate over the storage bucket too to discover recently uploaded Blocks but, unlike store-gateways, queriers do **not** download any content from Blocks except a small `meta.json` file containing the Block's metadata (including the minimum and maximum timestamp of samples within the Block). Queriers use the metadata to compute the list of Blocks that need to be queried at query time and fetch matching series from the store-gateway instances holding the required Blocks.

### The write path

**Ingesters** receive incoming samples from the distributors. Each push request belongs to a tenant, and the ingester append the received samples to the specific per-tenant TSDB. The received samples are both kept in-memory and written to a write-ahead log (WAL) stored on the local disk and used to recover the in-memory series in case the ingester abruptly terminates. The per-tenant TSDB is lazily created in each ingester upon the first push request is received for that tenant.

The in-memory samples are periodically flushed to disk - and the WAL truncated - when a new TSDB Block is cut, which by default occurs every 2 hours. Each new Block cut is then uploaded to the long-term storage and kept in the ingester for some more time, in order to give queriers enough time to discover the new Block from the storage and download its index header.
The in-memory samples are periodically flushed to disk - and the WAL truncated - when a new TSDB Block is created, which by default occurs every 2 hours. Each newly created Block is then uploaded to the long-term storage and kept in the ingester for some more time, in order to give queriers and store-gateways enough time to discover the new Block on the storage and download its index-header.

In order to effectively use the **WAL** and being able to recover the in-memory series upon ingester abruptly termination, the WAL needs to be stored to a persistent local disk which can survive in the event of an ingester failure (ie. AWS EBS volume or GCP persistent disk when running in the cloud). For example, if you're running the Cortex cluster in Kubernetes, you may use a StatefulSet with a persistent volume claim for the ingesters.

### The read path

**Queriers** - at startup - iterate over the entire storage bucket to discover all tenants Blocks and - for each of them - download the index header. During this initial synchronization phase, a querier is not ready to handle incoming queries yet and its `/ready` readiness probe endpoint will fail.
At startup **store-gateways** iterate over the entire storage bucket to discover all tenants Blocks and download the `meta.json` and index-header for each Block. During this initial bucket synchronization phase, the store-gateway `/ready` readiness probe endpoint will fail.

Queriers also periodically re-iterate over the storage bucket to discover newly uploaded Blocks (by the ingesters) and find out Blocks deleted in the meanwhile, as effect of an optional retention policy.
Similarly, **queriers** also iterate over the entire storage bucket at startup, to discover all tenants Blocks and download the `meta.json` for each Block. During this initial bucket scanning phase, a querier is not ready to handle incoming queries yet and its `/ready` readiness probe endpoint will fail.

The blocks chunks and the entire index is never fully downloaded by the queriers. In the read path, a querier lookups the series label names and values using the in-memory index header and then download the required segments of the index and chunks for the matching series directly from the long-term storage using byte-range requests.
Store-gateways and queriers also periodically re-iterate over the storage bucket to discover newly uploaded Blocks (by the ingesters) and find out Blocks marked for deletion or hard deleted in the meantime, as a result of compaction or an optional retention policy. The frequency at which this occurs is configured via `-experimental.tsdb.bucket-store.sync-interval`.

The index header is also stored to the local disk, in order to avoid to re-download it on subsequent restarts of a querier. For this reason, it's recommended - but not required - to run the querier with a persistent local disk. For example, if you're running the Cortex cluster in Kubernetes, you may use a StatefulSet with a persistent volume claim for the queriers.
The blocks chunks and the entire index are never fully downloaded by the store-gateway. The index-header is stored to the local disk, in order to avoid to re-download it on subsequent restarts of a store-gateway. For this reason, it's recommended - but not required - to run the store-gateway with a persistent local disk. For example, if you're running the Cortex cluster in Kubernetes, you may use a StatefulSet with a persistent volume claim for the store-gateways.

### Series sharding and replication
### Distributor series sharding and replication

The series sharding and replication doesn't change based on the storage engine, so the general overview provided by the "[Cortex architecture](../architecture.md)" documentation applies to the blocks storage as well.
The series sharding and replication done by the distributor doesn't change based on the storage engine, so the general overview provided by the "[Cortex architecture](../architecture.md)" documentation applies to the blocks storage as well.

It's important to note that - differently than the [chunks storage](../architecture.md#chunks-storage-default) - time series are effectively written N times to the long-term storage, where N is the replication factor (typically 3). This may lead to a storage utilization N times more than the chunks storage, but is actually mitigated by the [compactor](#compactor) service (see "vertical compaction").

### Store-gateway blocks sharding and replication

Blocks can be optionally sharded and replicated across a pool of store-gateway instances. This feature can be enabled via `-experimental.store-gateway.sharding-enabled=true` and requires the backend ring to be configured via `-experimental.store-gateway.sharding-ring.*` flags (or their respective YAML config options).

When blocks sharding is **enabled**, store-gateway instances builds a ring and blocks are sharded accordingly. Queriers also need to have the same store-gateway sharding config in order to be able to correctly address blocks across the pool of store-gateways.

When blocks sharding is **disabled**, queriers need the `-experimental.querier.store-gateway-addresses` CLI flag (or its respective YAML config option) being set to a comma separated list of store-gateway addresses in [DNS Service Discovery format]((../configuration/arguments.md#dns-service-discovery). Queriers will evenly balance the requests to query blocks across the resolved addresses.

### Anatomy of a query execution

When a querier receives a query range request, it contains few main information:

- `query`: the PromQL query expression itself (e.g. `rate(node_cpu_seconds_total[1m])`)
- `start`: the start time
- `end`: the end time
- `step`: the query resolution (e.g. `30000` ms to have 1 resulting data point every 30s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Step sent to Prometheus API endpoint is in seconds by default, but can be specified in "duration" format as well.


Given a query, the querier analyzes the `start` and `end` time range to compute a list of all known Blocks containing at least 1 sample within this time range. Given the list of Blocks, the querier then computes a list of store-gateway instances holding these Blocks and sends a request to each matching store-gateway instance asking to fetch all the samples for the series matching the `query` within the `start` and `end` time range.

The request sent to each store-gateway instance contains the list of Block IDs that are expected to be queried, and the response sent by the store-gateway back to the querier contains the list of Block IDs actually queried. The set of Blocks actually queried by a store-gateway may be a subset of the requested Blocks as effect of a Blocks re-sharding recently occurred (ie. last few seconds). The querier runs a consistency check on responses received from the store-gateways to ensure all expected Blocks have been queried; if not, the querier retries to fetch missing Blocks from different store-gateways (if the `-experimental.store-gateway.replication-factor` is greater than `1`) and if the consistency check fails after all retries, the query execution will fail as well (correctness is always guaranteed).

Alongside the requests sent to the store-gateway, if the query time range covers a period within `-querier.query-ingesters-within`, the querier also sends a request to ingesters, in order to fetch samples which have not been persistent into a Block uploaded to the long-term storage yet.

Once all samples have been fetched from both store-gateways and ingesters, the querier proceed running the PromQL engine to execute the query and send back the result to the client.

### Compactor

The **compactor** is an optional - but highly recommended - service which compacts multiple Blocks of a given tenant into a single optimized larger Block. The compactor has two main benefits:
Expand All @@ -57,7 +84,7 @@ The **compactor** is an optional - but highly recommended - service which compac

The **vertical compaction** compacts all the Blocks of a tenant uploaded by any ingester for the same Block range period (defaults to 2 hours) into a single Block, de-duplicating samples that are originally written to N Blocks as effect of the replication.

The **horizontal compaction** triggers after the vertical compaction and compacts several Blocks belonging to adjacent small range periods (2 hours) into a single larger Block. Despite the total block chunks size doesn't change after this compaction, it may have a significative impact on the reduction of the index size and its index header kept in memory by queriers.
The **horizontal compaction** triggers after the vertical compaction and compacts several Blocks belonging to adjacent small range periods (2 hours) into a single larger Block. Despite the total block chunks size doesn't change after this compaction, it may have a significative impact on the reduction of the index size and its index-header kept in memory by queriers.

The compactor is **stateless**.

Expand All @@ -74,7 +101,7 @@ Whenever the pool of compactors increase or decrease (ie. following up a scale u

## Index cache

The querier and store-gateway support a cache to speed up postings and series lookups from TSDB blocks indexes. Two backends are supported:
The store-gateway support a cache to speed up postings and series lookups from TSDB blocks indexes. Two backends are supported:

- `inmemory`
- `memcached`
Expand All @@ -84,15 +111,15 @@ The querier and store-gateway support a cache to speed up postings and series lo
The `inmemory` index cache is **enabled by default** and its max size can be configured through the flag `-experimental.tsdb.bucket-store.index-cache.inmemory.max-size-bytes` (or config file). The trade-off of using the in-memory index cache is:

- Pros: zero latency
- Cons: increased querier memory usage, not shared across multiple querier replicas
- Cons: increased store-gateway memory usage, not shared across multiple store-gateway replicas (when sharding is disabled or replication factor > 1)

### Memcached index cache

The `memcached` index cache allows to use [Memcached](https://memcached.org/) as cache backend. This cache backend is configured using `-experimental.tsdb.bucket-store.index-cache.backend=memcached` and requires the Memcached server(s) addresses via `-experimental.tsdb.bucket-store.index-cache.memcached.addresses` (or config file). The addresses are resolved using the [DNS service provider](../configuration/arguments.md#dns-service-discovery).

The trade-off of using the Memcached index cache is:

- Pros: can scale beyond a single node memory (Memcached cluster), shared across multiple querier instances
- Pros: can scale beyond a single node memory (Memcached cluster), shared across multiple store-gateway instances
- Cons: higher latency in the cache round trip compared to the in-memory one

The Memcached client uses a jump hash algorithm to shard cached entries across a cluster of Memcached servers. For this reason, you should make sure memcached servers are **not** behind any kind of load balancer and their address is configured so that servers are added/removed to the end of the list whenever a scale up/down occurs.
Expand All @@ -105,15 +132,15 @@ For example, if you're running Memcached in Kubernetes, you may:

## Chunks cache

Store-gateway and querier also support cache for storing chunks fetched from storage. Chunks contain actual samples, and can be reused if user query hits the same series for the same time range.
Store-gateway also support cache for storing chunks fetched from storage. Chunks contain actual samples, and can be reused if user query hits the same series for the same time range.

To enable chunks cache, please set `-experimental.tsdb.bucket-store.chunks-cache.backend`. Chunks can currently only be stored into Memcached cache. Memcached client can be configured via flags with `-experimental.tsdb.bucket-store.chunks-cache.memcached` prefix.

There are additional low-level options for configuring chunks cache. Please refer to other flags with `experimental.tsdb.bucket-store.chunks-cache` prefix.

## Metadata cache

Store-gateway and querier can use memcached for storing metadata: list of users, list of blocks per user, meta.json files and deletion mark files. Using the cache can reduce number of API calls to object storage significantly.
Store-gateway and querier can use memcached for caching bucket metadata: list of users, list of blocks per user, meta.json files and deletion mark files. Using the cache can reduce number of API calls to object storage significantly.

To enable metadata cache, please set `-experimental.tsdb.bucket-store.metadata-cache.backend`. Only `memcached` backend is supported currently. Memcached client has additional configuration available via flags with `-experimental.tsdb.bucket-store.metadata-cache.memcached` prefix.

Expand Down Expand Up @@ -163,4 +190,4 @@ The typical case where this issue triggers is after a long outage. Let's conside

### Migrating from the chunks to the blocks storage

Currently, no smooth migration path is provided to migrate from chunks to blocks storage. For this reason, the blocks storage can only be enabled in new Cortex clusters.
We're currently working on a smooth migration path from chunks to blocks storage. More information can be find in this [proposal](../proposals/ingesters-migration.md).
Loading