cortexproject
diff --git a/‎ADOPTERS.md
+1 b/‎ADOPTERS.md
+1
diff --git a/‎CHANGELOG.md
+8-1 b/‎CHANGELOG.md
+8-1
diff --git a/‎docs/blocks-storage/_index.md
+1-1 b/‎docs/blocks-storage/_index.md
+1-1
diff --git a/‎docs/blocks-storage/bucket-index.md
+57 b/‎docs/blocks-storage/bucket-index.md
+57
diff --git a/‎docs/blocks-storage/compactor.md
+1-1 b/‎docs/blocks-storage/compactor.md
+1-1
diff --git a/‎docs/blocks-storage/compactor.template
+1-1 b/‎docs/blocks-storage/compactor.template
+1-1
diff --git a/‎docs/blocks-storage/querier.md
+64-4 b/‎docs/blocks-storage/querier.md
+64-4
@@ -2,6 +2,7 @@
 
 This is the list of organisations that are using Cortex in **production environments** to power their metrics and monitoring systems. Please send PRs to add or remove organisations.
 
+* [Amazon Web Services (AWS)](https://aws.amazon.com/prometheus)
 * [Aspen Mesh](https://aspenmesh.io/)
 * [Buoyant](https://buoyant.io/)
 * [DigitalOcean](https://www.digitalocean.com/)
 
@@ -6,17 +6,24 @@
 * [CHANGE] Blocks storage: compactor is now required when running a Cortex cluster with the blocks storage, because it also keeps the bucket index updated. #3583
 * [CHANGE] Blocks storage: block deletion marks are now stored in a per-tenant global markers/ location too, other than within the block location. The compactor, at startup, will copy deletion marks from the block location to the global location. This migration is required only once, so you can safely disable it via `-compactor.block-deletion-marks-migration-enabled=false` once new compactor has successfully started once in your cluster. #3583
 * [ENHANCEMENT] Blocks storage: introduced a per-tenant bucket index, periodically updated by the compactor, used to avoid full bucket scanning done by queriers and store-gateways. The bucket index is updated by the compactor during blocks cleanup, on every `-compactor.cleanup-interval`. #3553 #3555 #3561 #3583
+* [ENHANCEMENT] Blocks storage: introduced an option `-blocks-storage.bucket-store.bucket-index.enabled` to enable the usage of the bucket index in the querier. When enabled, the querier will use the bucket index to find a tenant's blocks instead of running the periodic bucket scan. The following new metrics have been added: #3614
+  * `cortex_bucket_index_loads_total`
+  * `cortex_bucket_index_load_failures_total`
+  * `cortex_bucket_index_load_duration_seconds`
+  * `cortex_bucket_index_loaded`
 * [ENHANCEMENT] Compactor: exported the following metrics. #3583
   * `cortex_bucket_blocks_count`: Total number of blocks per tenant in the bucket. Includes blocks marked for deletion.
   * `cortex_bucket_blocks_marked_for_deletion_count`: Total number of blocks per tenant marked for deletion in the bucket.
   * `cortex_bucket_index_last_successful_update_timestamp_seconds`: Timestamp of the last successful update of a tenant's bucket index.
 * [ENHANCEMENT] Ruler: Add `cortex_prometheus_last_evaluation_samples` to expose the number of samples generated by a rule group per tenant. #3582
 * [ENHANCEMENT] Memberlist: add status page (/memberlist) with available details about memberlist-based KV store and memberlist cluster. It's also possible to view KV values in Go struct or JSON format, or download for inspection. #3575
-* [ENHANCEMENT] Memberlist: client can now keep a size-bounded buffer with sent and received messages and display them in the admin UI (/memberlist) for troubleshooting. #3581
+* [ENHANCEMENT] Memberlist: client can now keep a size-bounded buffer with sent and received messages and display them in the admin UI (/memberlist) for troubleshooting. #3581 #3602
+* [BUGFIX] Allow `-querier.max-query-lookback` use `y|w|d` suffix like deprecated `-store.max-look-back-period`. #3598
 * [BUGFIX] Query-Frontend: `cortex_query_seconds_total` now return seconds not nanoseconds. #3589
 * [ENHANCEMENT] Add api to list all tenant alertmanager configs and ruler rules. #3259
    - `GET /multitenant_alertmanager/configs`
    - `GET /ruler/rules`
+* [BUGFIX] Memberlist: Entry in the ring should now not appear again after using "Forget" feature (unless it's still heartbeating). #3603
 
 ## 1.6.0-rc.0 in progress
 
 
@@ -29,7 +29,7 @@ When running the Cortex blocks storage, the Cortex architecture doesn't signific
 
 The **[store-gateway](./store-gateway.md)** is responsible to query blocks and is used by the [querier](./querier.md) at query time. The store-gateway is required when running the blocks storage.
 
-The **[compactor](./compactor.md)** is responsible to merge and deduplicate smaller blocks into larger ones, in order to reduce the number of blocks stored in the long-term storage for a given tenant and query them more efficiently. It also keeps the bucket index updated and, for this reason, it's a required component.
+The **[compactor](./compactor.md)** is responsible to merge and deduplicate smaller blocks into larger ones, in order to reduce the number of blocks stored in the long-term storage for a given tenant and query them more efficiently. It also keeps the [bucket index](./bucket-index.md) updated and, for this reason, it's a required component.
 
 Finally, the [**table-manager**](../chunks-storage/table-manager.md) and the [**schema config**](../chunks-storage/schema-config.md) are **not used** by the blocks storage.
 
 
@@ -0,0 +1,57 @@
+---
+title: "Bucket Index"
+linkTitle: "Bucket Index"
+weight: 5
+slug: bucket-index
+---
+
+The bucket index is a **per-tenant file containing the list of blocks and block deletion marks** in the storage. The bucket index itself is stored in the backend object storage, is periodically updated by the compactor and used by queriers to discover blocks in the storage.
+
+The bucket index usage is **optional** and can be enabled via `-blocks-storage.bucket-store.bucket-index.enabled=true` (or its respective YAML config option).
+
+## Benefits
+
+The [querier](./querier.md) needs to have an almost up-to-date view over the entire storage bucket, in order to find the right blocks to lookup at query time. Because of this, querier needs to periodically scan the bucket to look for new blocks uploaded by ingester or compactor, and blocks deleted (or marked for deletion) by compactor.
+
+When this bucket index is enabled, the querier periodically look up the per-tenant bucket index instead of scanning the bucket via "list objects" operations. This brings few benefits:
+
+1. Reduced number of API calls to the object storage by querier
+2. No "list objects" storage API calls done by querier
+3. The [querier](./querier.md) is up and running immediately after the startup (no need to run an initial bucket scan)
+
+## Structure of the index
+
+The `bucket-index.json.gz` contains:
+
+- **`blocks`**<br />
+  List of complete blocks of a tenant, including blocks marked for deletion (partial blocks are excluded from the index).
+- **`block_deletion_marks`**<br />
+  List of block deletion marks.
+- **`updated_at`**<br />
+  Unix timestamp (seconds precision) of when the index has been updated (written in the storage) the last time.
+
+## How it gets updated
+
+The [compactor](./compactor.md) periodically scans the bucket and uploads an updated bucket index to the storage. The frequency at which the bucket index is updated can be configured via `-compactor.cleanup-interval`.
+
+Despite using the bucket index is optional, the index itself is built and updated by the compactor even if `-blocks-storage.bucket-store.bucket-index.enabled` has **not** been enabled. This is intentional, so that once a Cortex cluster operator decides to enable the bucket index in a live cluster, the bucket index for any tenant is already existing and query results consistency is guaranteed. The overhead introduced by keeping the bucket index updated is expected to be non significative.
+
+## How it's used by the querier
+
+The [querier](./querier.md), at query time, checks whether the bucket index for the tenant has already been loaded in memory. If not, the querier downloads it from the storage and cache it in memory.
+
+_Given it's a small file, lazy downloading it doesn't significantly impact on first query performances, but allows to get a querier up and running without pre-downloading every tenant's bucket index. Moreover, if the [metadata cache](./querier.md#metadata-cache) is enabled, the bucket index will be cached for a short time in a shared cache, reducing the actual latency and number of API calls to the object storage in case multiple queriers will fetch the same tenant's bucket index in a short time._
+
+![Querier - Bucket index](/images/blocks-storage/bucket-index-querier-logic.png)
+<!-- Diagram source at https://docs.google.com/presentation/d/1bHp8_zcoWCYoNU2AhO2lSagQyuIrghkCncViSqn14cU/edit -->
+
+While in-memory, a background process will keep it **updated at periodic intervals**, so that subsequent queries from the same tenant to the same querier instance will use the cached (and periodically updated) bucket index. There are two config options involved:
+
+- `-blocks-storage.bucket-store.bucket-index.update-on-stale-interval`<br />
+  This option configures how frequently a cached bucket index should be refreshed.
+- `-blocks-storage.bucket-store.bucket-index.update-on-error-interval`<br />
+  If downloading a bucket index fails, the failure is cached for a short time in order to avoid hammering the backend storage. This option configures how frequently a bucket index, which previously failed to load, should be tried to load again.
+
+If a bucket index is unused for a long time (configurable via `-blocks-storage.bucket-store.bucket-index.idle-timeout`), e.g. because that querier instance is not receiving any query from the tenant, the querier will offload it, stopping to keep it updated at regular intervals. This is particularly for tenants which are resharded to different queriers when [shuffle sharding](../guides/shuffle-sharding.md) is enabled.
+
+Finally, the querier, at query time, checks how old is a bucket index (based on its `updated_at`) and fail a query if its age is older than `-blocks-storage.bucket-store.bucket-index.max-stale-period`. This circuit breaker is used to ensure queriers will not return any partial query results due to a stale view over the long-term storage.
@@ -10,7 +10,7 @@ slug: compactor
 The **compactor** is an service which is responsible to:
 
 - Compact multiple blocks of a given tenant into a single optimized larger block. This helps to reduce storage costs (deduplication, index size reduction), and increase query speed (querying fewer blocks is faster).
-- Keep the per-tenant bucket index updated. The bucket index is used by [queriers](./querier.md) and [store-gateways](./store-gateway.md) to discover new blocks in the storage.
+- Keep the per-tenant bucket index updated. The [bucket index](./bucket-index.md) is used by [queriers](./querier.md) to discover new blocks in the storage.
 
 The compactor is **stateless**.
 
 
@@ -10,7 +10,7 @@ slug: compactor
 The **compactor** is an service which is responsible to:
 
 - Compact multiple blocks of a given tenant into a single optimized larger block. This helps to reduce storage costs (deduplication, index size reduction), and increase query speed (querying fewer blocks is faster).
-- Keep the per-tenant bucket index updated. The bucket index is used by [queriers](./querier.md) and [store-gateways](./store-gateway.md) to discover new blocks in the storage.
+- Keep the per-tenant bucket index updated. The [bucket index](./bucket-index.md) is used by [queriers](./querier.md) to discover new blocks in the storage.
 
 The compactor is **stateless**.
 
 
@@ -13,12 +13,28 @@ The querier is **stateless**.
 
 ## How it works
 
-At startup **queriers** iterate over the entire storage bucket to discover all tenants blocks and download the `meta.json` for each block. During this initial bucket scanning phase, a querier is not ready to handle incoming queries yet and its `/ready` readiness probe endpoint will fail.
+The querier needs to have an almost up-to-date view over the entire storage bucket, in order to find the right blocks to lookup at query time. The querier can keep the bucket view updated in to two different ways:
+
+1. Periodically scanning the bucket (default)
+2. Periodically downloading the [bucket index](./bucket-index.md)
+
+### Bucket index disabled (default)
+
+At startup, **queriers** iterate over the entire storage bucket to discover all tenants blocks and download the `meta.json` for each block. During this initial bucket scanning phase, a querier is not ready to handle incoming queries yet and its `/ready` readiness probe endpoint will fail.
 
 While running, queriers periodically iterate over the storage bucket to discover new tenants and recently uploaded blocks. Queriers do **not** download any content from blocks except a small `meta.json` file containing the block's metadata (including the minimum and maximum timestamp of samples within the block).
 
 Queriers use the metadata to compute the list of blocks that need to be queried at query time and fetch matching series from the [store-gateway](./store-gateway.md) instances holding the required blocks.
 
+### Bucket index enabled
+
+When [bucket index](./bucket-index.md) is enabled, queriers lazily download the bucket index upon the first query received for a given tenant, cache it in memory and periodically keep it update. The bucket index contains the list of blocks and block deletion marks of a tenant, which is later used during the query execution to find the set of blocks that need to be queried for the given query.
+
+Given the bucket index removes the need to scan the bucket, it brings few benefits:
+
+1. The querier is expected to be ready shortly after startup.
+2. Lower volume of API calls to object storage.
+
 ### Anatomy of a query request
 
 When a querier receives a query range request, it contains the following parameters:
@@ -60,6 +76,7 @@ Caching is optional, but **highly recommended** in a production environment. Ple
 - List of blocks per tenant
 - Block's `meta.json` content
 - Block's `deletion-mark.json` existence and content
+- Tenant's `bucket-index.json.gz` content
 
 Using the metadata cache can significantly reduce the number of API calls to object storage and protects from linearly scale the number of these API calls with the number of querier and store-gateway instances (because the bucket is periodically scanned and synched by each querier and store-gateway).
 
@@ -341,8 +358,8 @@ blocks_storage:
     # CLI flag: -blocks-storage.filesystem.dir
     [dir: <string> | default = ""]
 
-  # This configures how the store-gateway synchronizes blocks stored in the
-  # bucket.
+  # This configures how the querier and store-gateway discover and synchronize
+  # blocks stored in the bucket.
   bucket_store:
     # Directory to store synchronized TSDB index headers.
     # CLI flag: -blocks-storage.bucket-store.sync-dir
@@ -579,14 +596,30 @@ blocks_storage:
       # CLI flag: -blocks-storage.bucket-store.metadata-cache.metafile-content-ttl
       [metafile_content_ttl: <duration> | default = 24h]
 
-      # Maximum size of metafile content to cache in bytes.
+      # Maximum size of metafile content to cache in bytes. Caching will be
+      # skipped if the content exceeds this size. This is useful to avoid
+      # network round trip for large content if the configured caching backend
+      # has an hard limit on cached items size (in this case, you should set
+      # this limit to the same limit in the caching backend).
       # CLI flag: -blocks-storage.bucket-store.metadata-cache.metafile-max-size-bytes
       [metafile_max_size_bytes: <int> | default = 1048576]
 
       # How long to cache attributes of the block metafile.
       # CLI flag: -blocks-storage.bucket-store.metadata-cache.metafile-attributes-ttl
       [metafile_attributes_ttl: <duration> | default = 168h]
 
+      # How long to cache content of the bucket index.
+      # CLI flag: -blocks-storage.bucket-store.metadata-cache.bucket-index-content-ttl
+      [bucket_index_content_ttl: <duration> | default = 5m]
+
+      # Maximum size of bucket index content to cache in bytes. Caching will be
+      # skipped if the content exceeds this size. This is useful to avoid
+      # network round trip for large content if the configured caching backend
+      # has an hard limit on cached items size (in this case, you should set
+      # this limit to the same limit in the caching backend).
+      # CLI flag: -blocks-storage.bucket-store.metadata-cache.bucket-index-max-size-bytes
+      [bucket_index_max_size_bytes: <int> | default = 1048576]
+
     # Duration after which the blocks marked for deletion will be filtered out
     # while fetching blocks. The idea of ignore-deletion-marks-delay is to
     # ignore blocks that are marked for deletion with some delay. This ensures
@@ -596,6 +629,33 @@ blocks_storage:
     # CLI flag: -blocks-storage.bucket-store.ignore-deletion-marks-delay
     [ignore_deletion_mark_delay: <duration> | default = 6h]
 
+    bucket_index:
+      # True to enable querier to discover blocks in the storage via bucket
+      # index instead of bucket scanning.
+      # CLI flag: -blocks-storage.bucket-store.bucket-index.enabled
+      [enabled: <boolean> | default = false]
+
+      # How frequently a cached bucket index should be refreshed.
+      # CLI flag: -blocks-storage.bucket-store.bucket-index.update-on-stale-interval
+      [update_on_stale_interval: <duration> | default = 15m]
+
+      # How frequently a bucket index, which previously failed to load, should
+      # be tried to load again.
+      # CLI flag: -blocks-storage.bucket-store.bucket-index.update-on-error-interval
+      [update_on_error_interval: <duration> | default = 1m]
+
+      # How long a unused bucket index should be cached. Once this timeout
+      # expires, the unused bucket index is removed from the in-memory cache.
+      # CLI flag: -blocks-storage.bucket-store.bucket-index.idle-timeout
+      [idle_timeout: <duration> | default = 1h]
+
+      # The maximum allowed age of a bucket index (last updated) before queries
+      # start failing because the bucket index is too old. The bucket index is
+      # periodically updated by the compactor, while this check is enforced in
+      # the querier (at query time).
+      # CLI flag: -blocks-storage.bucket-store.bucket-index.max-stale-period
+      [max_stale_period: <duration> | default = 1h]
+
   tsdb:
     # Local directory to store TSDBs in the ingesters.
     # CLI flag: -blocks-storage.tsdb.dir