From e08af0fce89efe83582c197dc6482ff4c3aeca00 Mon Sep 17 00:00:00 2001 From: ilangofman Date: Wed, 9 Jun 2021 23:38:05 -0400 Subject: [PATCH 01/14] Add proposal document Signed-off-by: Gofman Signed-off-by: ilangofman --- .../block-storage-time-series-deletion.md | 271 ++++++++++++++++++ 1 file changed, 271 insertions(+) create mode 100644 docs/proposals/block-storage-time-series-deletion.md diff --git a/docs/proposals/block-storage-time-series-deletion.md b/docs/proposals/block-storage-time-series-deletion.md new file mode 100644 index 00000000000..ab016780e8d --- /dev/null +++ b/docs/proposals/block-storage-time-series-deletion.md @@ -0,0 +1,271 @@ +--- +title: "Time Series Deletion from Blocks Storage" +linkTitle: "Time Series Deletion from Blocks Storage" +weight: 1 +slug: block-storage-time-series-deletion +--- + +- Author: [Ilan Gofman](https://github.com/ilangofman) +- Date: June 2021 +- Status: Proposal + +## Problem + +Currently, Cortex only implements a time series deletion API for chunk storage. We present a design for implementing time series deletion with block storage. We would like to have the same API for deleting series as currently implemented in Prometheus and in Cortex with chunk storage. + + +This can be very important for users to have as confidential or accidental data might have been incorrectly pushed and needs to be removed. + +## Related Works + +As previously mentioned, the deletion feature is already implemented with chunk storage. The main functionality is implemented through the purger service. It accepts requests for deletion and processes them. At first, when a deletion request is made, a tombstone is created. This is used to filter out the data for queries. After some time, a deletion plan is executed where the data is permanently removed from chunk storage. + +Can find more info here: + +- [Cortex documentation for chunk store deletion](https://cortexmetrics.io/docs/guides/deleting-series/) +- [Chunk deletion proposal](https://docs.google.com/document/d/1PeKwP3aGo3xVrR-2qJdoFdzTJxT8FcAbLm2ew_6UQyQ/edit) + + + +## Background + +With a block-storage configuration, Cortex stores data that could be potentially deleted by a user in: + +- Object store (GCS, S3, etc..) for long term storage of blocks +- Ingestors for more recent data that should be eventually transferred to the object store +- Cache + - Index cache + - Metadata cache + - Chunks cache (stores the potentially to be deleted data) + - Query results cache (stores the potentially to be deleted data) + - Compactor during the compaction process + - Store-gateway + + +## Proposal + +The deletion will not happen right away. Initially, the data will be filtered out from queries using tombstones and will be deleted afterward. This will allow the user some time to cancel the delete request. + +### API Endpoints + +The existing purger service will be used to process the incoming requests for deletion. The API will follow the same structure as the chunk storage endpoints for deletion, which is also based on the Prometheus deletion API. + +This will enable the following endpoints for Cortex when using block storage: + +`POST /api/v1/admin/tsdb/delete_series` - Accepts [Prometheus style delete request](https://prometheus.io/docs/prometheus/latest/querying/api/#delete-series) for deleting series. + +Parameters: + +- `start=` + - Optional. If not provided, will be set to minimum possible time. +- `end= ` + - Optional. If not provided, will be set to maximum possible time (time when request was made). End time cannot be greater than the current UTC time. +- `match[]=` + - Cannot be empty, must contain at least one label matcher argument. + + +`POST /api/v1/admin/tsdb/cancel_delete_request` - To cancel a request if it has not been processed yet for permanent deletion. I.e. it is in the query filtering stage (see the deletion lifecycle below). + +Parameters: + +- `request_id` + +`GET /api/v1/admin/tsdb/delete_series` - Get all delete requests id’s and their current status. + +Prometheus also implements a [clean_tombstones](https://prometheus.io/docs/prometheus/latest/querying/api/#clean-tombstones) API which is not included in this proposal. The tombstones will be deleted automatically once the permanent deletion has taken place which is described in the section below. By default, this should take approximately 24 hours. + +### Deletion Lifecycle + +The deletion request lifecycle can follow the same states as currently implemented for chunk storage: + +1. StatusReceived - No actions are done on request yet, just doing query time filtering +2. StatusBuildingPlan - Request picked up for processing and building plans for it, still doing query time filtering +3. StatusDeleting - Plans built already, running delete operations and still doing query time filtering +4. StatusProcessed - All requested data deleted, not considering this for query time filtering + +(copied from the chunk storage series deletion proposal) + +With the current chunk store implementation, the amount of time for the request to move from StatusReceived to StatusBuildingPlan is dependent on a config option: `-purger.delete-request-cancel-period`. The purpose of this is to allow the user some time to cancel the deletion request if it was made by mistake. + + +#### Current entities in the Purger that can be reused + +The following entities already exist for the chunk store and can be adapted to work for block storage: + +- `DeleteStore` - Handles storing and fetching delete requests, delete plans, cache gen numbers +- `TombstonesLoader` - Helps loading delete requests and cache gen. Also keeps checking for updates. +- `CacheGenMiddleware` - Adds generation number as a prefix to cache keys before doing Store/Fetch operations. + +(copied from the chunk storage series deletion proposal) + + + +### Filtering data during queries while not yet deleted: + +This will be done during the `DeleteRequest = StatusReceived` and `DeleteRequest = StatusBuildingPlan` parts of the deletion lifecycle. + +Once a deletion request is received, a tombstone entry will be created. The object store such as S3, GCS, Azure storage, can be used to store all the deletion requests. See the section below for more detail on how the tombstones will be stored. Using the tombstones, the querier will be able to filter the to-be-deleted data initially. In addition, the existing cache will be invalidated using cache generation numbers, which are described in the later sections. + +From the purger service, the existing TombstonesLoader will periodically check for new tombstones in the object-store and load the new requests for deletion periodically using the modified DeleteStore. They will be loaded in memory periodically instead of retrieving all the tombstones when performing queries. Currently, with chunk storage, the TombstonesLoader is configured to check for updates to the DeleteStore every 5 minutes. + +Similar to the chunk storage deletion implementation, the initial filtering of the deleted data will be done inside the Querier. This will allow filtering the data read from both the store gateway and the ingestor. This functionality already exists for the chunk storage implementation. By implementing it in the querier, this would mean that the ruler can also utilize this to filter out the various metrics for the alert manager (read from the store gateway). + + +#### Storing Tombstones in Object Store + + +The Purger will store the tombstone entries in a separate folder called “series_deletion” in the object store (e.g. S3 bucket) in the respective tenant folder. Each tombstone can have a separate JSON file outlining all the necessary information about the deletion request such as the parameters passed in the request, as well as the current status and some meta-data such as the creation date of the request. The name of the file can be a hash of the API parameters (start, end, markers). This way if a user calls the API twice by accident with the same parameters, it will only create one tombstone. + +The tombstone will be stored in a single file per request: + +- `//series_deletion/requests/.json` + + +The schema of the JSON file is: + + +```{ + "requestId": , + "startTime": , + "endTime": , + "creationTime": , + "matchers": [ + "", + .., + "" + ] + }, + "userID": , + "deleteRequestStatus": +} +``` + + +Pros: + +- Design is similar to the existing chunk storage deletion + - Lots of code can be reused inside the purger component. +- Allows deletion and un-delete to be done in a single operation. + +Cons: +- Negative impact on query performance when there are active tombstones. As in the chunk storage implementation, all the queries made will have to be compared to the matchers contained in the active tombstone files. The impact on performance should be the same as the deletion would have with chunk storage. + + +#### Invalidating Cache + +Using block store, the different caches available are: +- Index cache +- Metadata cache +- Chunks cache (stores the potentially to be deleted chunks of data) +- Query results cache (stores the potentially to be deleted data) + +The filtering using tombstones is only requested by the querier. By using the purger, the querier filters out the data received from the ingestors and store-gateway. The cache not being processed through the querier needs to be invalidated to prevent deleted data from coming up in queries. There are two potential caches that could contain deleted data, the chunks cache, and the query results cache. + + +Firstly, the query results cache needs to be invalidated for each new delete request. This can be done using the same mechanism currently used for chunk storage by utilizing the cache generation numbers. For each tenant, their cache is prefixed with a cache generation number. This is already implemented into the middleware and would be easy to use for invalidating the cache. When the cache needs to be invalidated due to a delete or cancel delete request, the cache generation numbers would be increased (to the current timestamp), which would invalidate all the cache entries for a given tenant. The cache generation numbers are currently being stored in an Index table (e.g. DynamoDB or Bigtable). One option for block store is to store a per tenant JSON file with the corresponding cache generation numbers. + +Furthermore, since the chunks cache is retrieved from the store gateway and passed to the querier, it will be filtered out like the rest of the time series data in the querier using the tombstones, with the mechanism described in the previous section. However, some issues may arise if the tombstone is deleted but the data to-be-deleted still exists in the chunks cache. This is why it is also best to use cache generation numbers for this cache in order to invalidate it the same way. + +This file can be stored in: + +- `//series_deletion/cache_generation_numbers.json` + +An example of the schema for the JSON file: + +``` +{ + "userID": , + "resultsCache": { + "generationNum": + }, + "storeCache": { + "generationNum": + } +} +``` + +### Permanently deleting the data + +To delete the data permanently from the block storage, deletion marker files will be used. The proposed approach is to perform the deletions from the compactor. A new background service inside the compactor called _DeletedSeriesCleaner_ can be created and is responsible for executing deletion. The process of permanently deleting the data can be separated into 2 stages, preprocessing and processing. + +#### Pre-processing + +This will happen after a grace period has passed once the API request has been made. By default, this should be 24 hours. The Purger will outline all the blocks that may contain data to be deleted. For each separate block that the deletion may be applicable to, the Purger will begin the process by adding a series deletion marker inside the series-deletion-marker.json file. The JSON file will contain an array of deletions that need to be applied to the block, which allows the ability to handle the situation when there are multiple tombstones that could be applicable to a particular block. + +#### Processing + + +This will happen when the `DeleteRequest = StatusDeleting` in the deletion lifecycle. A background task can be created to process the permanent deletion of time series using the information inside the series-deletion-marker.json files. This can be done each hour. + +To delete the data from the blocks, the same logic as the [Bucket Rewrite Tool](https://thanos.io/tip/components/tools.md/#bucket-rewrite +) from Thanos can be leveraged. This tool does the following: `tools bucket rewrite rewrites chosen blocks in the bucket, while deleting or modifying series`. The tool itself is a CLI tool that we won’t be using, but instead we can utilize the logic inside it. For more information about the way this tool runs, please see the code [here](https://github.com/thanos-io/thanos/blob/d8b21e708bee6d19f46ca32b158b0509ca9b7fed/cmd/thanos/tools_bucket.go#L809). + +The compactor’s _DeletedSeriesCleaner_ will apply this logic on individual blocks and each time it is run, it creates a new block without the data that matched the deletion request. The original individual blocks containing the data that was requested to be deleted, need to be marked for deletion by the compactor. + +One important thing to note regarding this tool is that it should not be used at the same time as when another compactor is touching a block. If the tool is run at the same time as compaction on a particular block, it can cause overlap and the data marked for deletion can already be part of the compacted block. To mitigate such issues, these are some of the proposed solutions: + +Option 1: Only apply the deletion once the blocks are in the final state of compaction. + +Pros: +- Simpler implementation as everything is contained within the DeletedSeriesCleaner. + +Cons: +- Might have to wait for a longer period of time for the compaction to be finished. + - This would mean the earliest time to be able to run the deletion would be once the last time from the block_ranges in the [compactor_config](https://cortexmetrics.io/docs/blocks-storage/compactor/#compactor-configuration) has passed. By default this value is 24 hours, so only once 24 hours have passed and the new compacted blocks have been created, then the rewrite can be safely run. + + + + +Option 2: For blocks that still need to be compacted further after the deletion request cancel period is over, the deletion logic can be applied before the blocks are compacted. This will generate a new block which can then be used instead for compaction with other blocks. + +Pros: +- The deletion can be applied earlier than the previous options. + - Only applies if the deletion request cancel period is less than the last time interval for compaction is. +Cons: +- Added coupling between the compaction and the DeletedSeriesCleaner. +- Might block compaction for a short time while doing the deletion. + +In the newly created block without the deleted time series data, the information about the deletion is added to the meta.json file. This will indicate which deletion requests have been filtered out of this new block. This is necessary because it will let the Purger service know that this block doesn’t need to be rewritten again. + +To determine when a deletion request is complete, the purger will iterate through all the applicable blocks that might have data to be deleted. If there are any blocks that don’t have the tombstone ID in the meta.json of the block indicating the deletion has been complete, then the purger will add the series deletion markers to those blocks (if it doesn’t already exist). If after iterating through all blocks, it doesn’t find any such blocks, then that means the compactor has finished executing. + +Once all the applicable blocks have been rewritten without the deleted data, the deletion request moves to `DeleteRequest = StatusProcessed` and the tombstone is deleted. + + + +##### Handling failed/unfinished delete jobs: + +Deletions will be completed and the tombstones will be deleted only when the Purger iterates over all blocks that match the time interval and confirms that they have been re-written without the deleted data. Otherwise, it will keep creating the markers indicating which blocks are remaining for deletion. In case of any failure that causes the deletion to stop, any unfinished deletions will be resumed once the service is restarted. The series deletion markers will remain in the bucket until the new blocks are created without the deleted data. Meaning that the compactor will continue to process the blocks for deletion that are remaining according to the deletion markers. + + +#### Tenant Deletion API + +If a request is made to delete a tenant, then all the tombstones will be deleted for that user. For all the tombstones deleted, if there were any series deletion markers for the tombstones deleted, these will also need to be deleted prior to marking the tenant’s blocks for deletion. + + + +## Current Open Questions: + +- If the start and end time is very far apart, it might result in a lot of the data being re-written. Since we create a new block without the deleted data and mark the old one for deletion, there may be a period of time with lots of extra blocks and space used for large deletion queries. +- Need to outline more clearly how this will work with multiple deletion requests at a time. + + + +## Alternatives Considered + + +For processing the actual deletions, an alternative approach is not to wait until the final compaction has been completed and filter out the data during compaction. If the data is marked to be deleted, then don’t include it the new bigger block during compaction. For the remaining blocks where the data wasn’t filtered during compaction, the deletion can be done the same as in the previous section. + +Pros: + +- The deletion can happen sooner. +- The rewrite tools creates additional blocks. By filtering the metrics during compaction, the intermediary re-written block will be avoided. + +Cons: + +- A more complicated implementation requiring add more logic to the compactor +- Slower compaction if it needs to filter all the data +- Need to manage which blocks should be deleted with the rewrite vs which blocks already had data filtered during compaction. +- Would need to run the rewrite logic during and outside of compaction because some blocks that might need to be deleted are already in the final compaction state. So that would mean the deletion functionality has to be implemented in multiple places. +- Won’t be leveraging the rewrites tools from Thanos for all the deletion, so potentially more work is duplicated + From 8027c4afc40b743d75b3d3caa800395ade1e7308 Mon Sep 17 00:00:00 2001 From: ilangofman Date: Wed, 9 Jun 2021 23:53:59 -0400 Subject: [PATCH 02/14] Minor text modifications Signed-off-by: ilangofman --- .../proposals/block-storage-time-series-deletion.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/docs/proposals/block-storage-time-series-deletion.md b/docs/proposals/block-storage-time-series-deletion.md index ab016780e8d..1d681230153 100644 --- a/docs/proposals/block-storage-time-series-deletion.md +++ b/docs/proposals/block-storage-time-series-deletion.md @@ -16,7 +16,7 @@ Currently, Cortex only implements a time series deletion API for chunk storage. This can be very important for users to have as confidential or accidental data might have been incorrectly pushed and needs to be removed. -## Related Works +## Related works As previously mentioned, the deletion feature is already implemented with chunk storage. The main functionality is implemented through the purger service. It accepts requests for deletion and processes them. At first, when a deletion request is made, a tombstone is created. This is used to filter out the data for queries. After some time, a deletion plan is executed where the data is permanently removed from chunk storage. @@ -27,7 +27,7 @@ Can find more info here: -## Background +## Background on current storage With a block-storage configuration, Cortex stores data that could be potentially deleted by a user in: @@ -111,7 +111,7 @@ From the purger service, the existing TombstonesLoader will periodically check f Similar to the chunk storage deletion implementation, the initial filtering of the deleted data will be done inside the Querier. This will allow filtering the data read from both the store gateway and the ingestor. This functionality already exists for the chunk storage implementation. By implementing it in the querier, this would mean that the ruler can also utilize this to filter out the various metrics for the alert manager (read from the store gateway). -#### Storing Tombstones in Object Store +#### Storing tombstones in object store The Purger will store the tombstone entries in a separate folder called “series_deletion” in the object store (e.g. S3 bucket) in the respective tenant folder. Each tombstone can have a separate JSON file outlining all the necessary information about the deletion request such as the parameters passed in the request, as well as the current status and some meta-data such as the creation date of the request. The name of the file can be a hash of the API parameters (start, end, markers). This way if a user calls the API twice by accident with the same parameters, it will only create one tombstone. @@ -124,7 +124,8 @@ The tombstone will be stored in a single file per request: The schema of the JSON file is: -```{ +``` +{ "requestId": , "startTime": , "endTime": , @@ -151,7 +152,7 @@ Cons: - Negative impact on query performance when there are active tombstones. As in the chunk storage implementation, all the queries made will have to be compared to the matchers contained in the active tombstone files. The impact on performance should be the same as the deletion would have with chunk storage. -#### Invalidating Cache +#### Invalidating cache Using block store, the different caches available are: - Index cache @@ -233,7 +234,7 @@ Once all the applicable blocks have been rewritten without the deleted data, the -##### Handling failed/unfinished delete jobs: +#### Handling failed/unfinished delete jobs: Deletions will be completed and the tombstones will be deleted only when the Purger iterates over all blocks that match the time interval and confirms that they have been re-written without the deleted data. Otherwise, it will keep creating the markers indicating which blocks are remaining for deletion. In case of any failure that causes the deletion to stop, any unfinished deletions will be resumed once the service is restarted. The series deletion markers will remain in the bucket until the new blocks are created without the deleted data. Meaning that the compactor will continue to process the blocks for deletion that are remaining according to the deletion markers. From 9081eb28137b797cfd269dbca8c4f1d8fb517de6 Mon Sep 17 00:00:00 2001 From: ilangofman Date: Mon, 14 Jun 2021 18:36:35 -0400 Subject: [PATCH 03/14] Implement requested changes to the proposal Signed-off-by: ilangofman --- .../block-storage-time-series-deletion.md | 113 +++++++----------- 1 file changed, 46 insertions(+), 67 deletions(-) diff --git a/docs/proposals/block-storage-time-series-deletion.md b/docs/proposals/block-storage-time-series-deletion.md index 1d681230153..4f003afb8e3 100644 --- a/docs/proposals/block-storage-time-series-deletion.md +++ b/docs/proposals/block-storage-time-series-deletion.md @@ -14,7 +14,7 @@ slug: block-storage-time-series-deletion Currently, Cortex only implements a time series deletion API for chunk storage. We present a design for implementing time series deletion with block storage. We would like to have the same API for deleting series as currently implemented in Prometheus and in Cortex with chunk storage. -This can be very important for users to have as confidential or accidental data might have been incorrectly pushed and needs to be removed. +This can be very important for users to have as confidential or accidental data might have been incorrectly pushed and needs to be removed. As well as potentially removing high cardinality data that is causing ineffecient queries. ## Related works @@ -32,14 +32,14 @@ Can find more info here: With a block-storage configuration, Cortex stores data that could be potentially deleted by a user in: - Object store (GCS, S3, etc..) for long term storage of blocks -- Ingestors for more recent data that should be eventually transferred to the object store +- Ingesters for more recent data that should be eventually transferred to the object store - Cache - Index cache - Metadata cache - Chunks cache (stores the potentially to be deleted data) - Query results cache (stores the potentially to be deleted data) - - Compactor during the compaction process - - Store-gateway +- Compactor during the compaction process +- Store-gateway ## Proposal @@ -64,8 +64,7 @@ Parameters: - Cannot be empty, must contain at least one label matcher argument. -`POST /api/v1/admin/tsdb/cancel_delete_request` - To cancel a request if it has not been processed yet for permanent deletion. I.e. it is in the query filtering stage (see the deletion lifecycle below). - +`POST /api/v1/admin/tsdb/cancel_delete_request` - To cancel a request if it has not been processed yet for permanent deletion. This can only be done before the `-purger.delete-request-cancel-period` has passed. Parameters: - `request_id` @@ -76,49 +75,36 @@ Prometheus also implements a [clean_tombstones](https://prometheus.io/docs/prome ### Deletion Lifecycle -The deletion request lifecycle can follow the same states as currently implemented for chunk storage: - -1. StatusReceived - No actions are done on request yet, just doing query time filtering -2. StatusBuildingPlan - Request picked up for processing and building plans for it, still doing query time filtering -3. StatusDeleting - Plans built already, running delete operations and still doing query time filtering -4. StatusProcessed - All requested data deleted, not considering this for query time filtering - -(copied from the chunk storage series deletion proposal) - -With the current chunk store implementation, the amount of time for the request to move from StatusReceived to StatusBuildingPlan is dependent on a config option: `-purger.delete-request-cancel-period`. The purpose of this is to allow the user some time to cancel the deletion request if it was made by mistake. - - -#### Current entities in the Purger that can be reused - -The following entities already exist for the chunk store and can be adapted to work for block storage: +The deletion request lifecycle can follow these 3 states: -- `DeleteStore` - Handles storing and fetching delete requests, delete plans, cache gen numbers -- `TombstonesLoader` - Helps loading delete requests and cache gen. Also keeps checking for updates. -- `CacheGenMiddleware` - Adds generation number as a prefix to cache keys before doing Store/Fetch operations. - -(copied from the chunk storage series deletion proposal) +1. Received - Tombstone file is created, just doing query time filtering +2. Deleting - Running delete operations and still doing query time filtering +3. Syncing - All requested data deleted, and still doing query time filtering. Waiting for the bucket index and store-gateway to pick up the new blocks and replace the old chunks cache. +4. Processed - All requested data deleted, chunks cache should contain new blocks and no longer doing query time filtering. +The amount of time for the request to move from `Received` to `Deleting` is dependent on a config option: `-purger.delete-request-cancel-period`. The purpose of this is to allow the user some time to cancel the deletion request if it was made by mistake. ### Filtering data during queries while not yet deleted: -This will be done during the `DeleteRequest = StatusReceived` and `DeleteRequest = StatusBuildingPlan` parts of the deletion lifecycle. +This will be done during the first 3 parts of the deletion lifecycle until the tombstone is deleted and the request's status becomes `Processed`. Once a deletion request is received, a tombstone entry will be created. The object store such as S3, GCS, Azure storage, can be used to store all the deletion requests. See the section below for more detail on how the tombstones will be stored. Using the tombstones, the querier will be able to filter the to-be-deleted data initially. In addition, the existing cache will be invalidated using cache generation numbers, which are described in the later sections. -From the purger service, the existing TombstonesLoader will periodically check for new tombstones in the object-store and load the new requests for deletion periodically using the modified DeleteStore. They will be loaded in memory periodically instead of retrieving all the tombstones when performing queries. Currently, with chunk storage, the TombstonesLoader is configured to check for updates to the DeleteStore every 5 minutes. +The compactor will scan for new tombstone files and will update the bucket-index with the tombstone information regarding the deletion requests. This will enable the querier to priodically check the bucket index if there are any new tombstone files that are required for filtering. One drawback of this approach is the time it could take to start filtering the data. Since the compactor will update the bucket index with the new tombstones every `-compactor.cleanup-interval` (default 15 min). Then the cached bucket index is refreshed in the querier every `-blocks-storage.bucket-store.sync-interval` (default 15 min). Potentially could take almost 30 min for queriers to start filtering deleted data when using the default values. If the information requested for deletion is confidential/classified, the time delay is something that the user should be aware of, in addition to the time that the data has already been in Cortex. -Similar to the chunk storage deletion implementation, the initial filtering of the deleted data will be done inside the Querier. This will allow filtering the data read from both the store gateway and the ingestor. This functionality already exists for the chunk storage implementation. By implementing it in the querier, this would mean that the ruler can also utilize this to filter out the various metrics for the alert manager (read from the store gateway). +An additional thing to consider is that this would mean that the bucket-index would have to be enabled for this API to work. Since the plan is to make to the bucket-index mandatory in the future for block storage, this shouldn't be an issue. +Similar to the chunk storage deletion implementation, the initial filtering of the deleted data will be done inside the Querier. This will allow filtering the data read from both the store gateway and the ingester. This functionality already exists for the chunk storage implementation. By implementing it in the querier, this would mean that the ruler will be supported too (ruler internally runs the querier). #### Storing tombstones in object store -The Purger will store the tombstone entries in a separate folder called “series_deletion” in the object store (e.g. S3 bucket) in the respective tenant folder. Each tombstone can have a separate JSON file outlining all the necessary information about the deletion request such as the parameters passed in the request, as well as the current status and some meta-data such as the creation date of the request. The name of the file can be a hash of the API parameters (start, end, markers). This way if a user calls the API twice by accident with the same parameters, it will only create one tombstone. +The Purger will store the tombstone entries in a separate folder called `tombstones` in the object store (e.g. S3 bucket) in the respective tenant folder. Each tombstone can have a separate JSON file outlining all the necessary information about the deletion request such as the parameters passed in the request, as well as some meta-data such as the creation date of the request. The name of the file can be a hash of the API parameters (start, end, markers). This way if a user calls the API twice by accident with the same parameters, it will only create one tombstone. To keep track of the request state, filename extensions can be used. This will allow the tombstone files to be immutable. The 4 different file extensions will be `received, deleting, syncing, processed`. -The tombstone will be stored in a single file per request: +The tombstone will be stored in a single JSON file per request and state: -- `//series_deletion/requests/.json` +- `//tombstones/.json.` The schema of the JSON file is: @@ -137,7 +123,6 @@ The schema of the JSON file is: ] }, "userID": , - "deleteRequestStatus": } ``` @@ -149,7 +134,8 @@ Pros: - Allows deletion and un-delete to be done in a single operation. Cons: -- Negative impact on query performance when there are active tombstones. As in the chunk storage implementation, all the queries made will have to be compared to the matchers contained in the active tombstone files. The impact on performance should be the same as the deletion would have with chunk storage. + +- Negative impact on query performance when there are active tombstones. As in the chunk storage implementation, all the series will have to be compared to the matchers contained in the active tombstone files. The impact on performance should be the same as the deletion would have with chunk storage. #### Invalidating cache @@ -160,50 +146,31 @@ Using block store, the different caches available are: - Chunks cache (stores the potentially to be deleted chunks of data) - Query results cache (stores the potentially to be deleted data) -The filtering using tombstones is only requested by the querier. By using the purger, the querier filters out the data received from the ingestors and store-gateway. The cache not being processed through the querier needs to be invalidated to prevent deleted data from coming up in queries. There are two potential caches that could contain deleted data, the chunks cache, and the query results cache. - - -Firstly, the query results cache needs to be invalidated for each new delete request. This can be done using the same mechanism currently used for chunk storage by utilizing the cache generation numbers. For each tenant, their cache is prefixed with a cache generation number. This is already implemented into the middleware and would be easy to use for invalidating the cache. When the cache needs to be invalidated due to a delete or cancel delete request, the cache generation numbers would be increased (to the current timestamp), which would invalidate all the cache entries for a given tenant. The cache generation numbers are currently being stored in an Index table (e.g. DynamoDB or Bigtable). One option for block store is to store a per tenant JSON file with the corresponding cache generation numbers. - -Furthermore, since the chunks cache is retrieved from the store gateway and passed to the querier, it will be filtered out like the rest of the time series data in the querier using the tombstones, with the mechanism described in the previous section. However, some issues may arise if the tombstone is deleted but the data to-be-deleted still exists in the chunks cache. This is why it is also best to use cache generation numbers for this cache in order to invalidate it the same way. - -This file can be stored in: +Using the tombstones, the queriers filter out the data received from the ingesters and store-gateway. The cache not being processed through the querier needs to be invalidated to prevent deleted data from coming up in queries. There are two potential caches that could contain deleted data, the chunks cache, and the query results cache. -- `//series_deletion/cache_generation_numbers.json` +Firstly, the query results cache needs to be invalidated for each new delete request. This can be done using the same mechanism currently used for chunk storage by utilizing the cache generation numbers. For each tenant, their cache is prefixed with a cache generation number. This is already implemented into the middleware and would be easy to use for invalidating the cache. When the cache needs to be invalidated due to a delete or cancel delete request, the cache generation numbers would be increased (to the current timestamp), which would invalidate all the cache entries for a given tenant. The cache generation numbers are currently being stored in an Index table (e.g. DynamoDB or Bigtable). One option for block store is to store a per tenant key using the KV-store with the ring backend and propogate it using a Compare-And-Set/Swap (CAS) operation. If the current cache generation number is older than the KV-store is older or it is empty, then the cache is invalidated and the current timestamp becomes the cache generation number. -An example of the schema for the JSON file: - -``` -{ - "userID": , - "resultsCache": { - "generationNum": - }, - "storeCache": { - "generationNum": - } -} -``` +Furthermore, since the chunks cache is retrieved from the store gateway and passed to the querier, it will be filtered out like the rest of the time series data in the querier using the tombstones, with the mechanism described in the previous section. However, some issues may arise if the tombstone is deleted but the data to-be-deleted still exists in the chunks cache. To prevent this, we add another state to the deletion process called `syncing`. The tombstones will need to continue filtering the data until the store-gateway picks up the new blocks and the chunks cache is able to be refreshed with the new blocks without the deleted data. The `syncing` state will begin as soon as all the requested data has been permentantly deleted from the block store. This state will last `-compactor.deletion-delay + -compactor.cleanup-interval + -blocks-storage.bucket-store.sync-interval`. Once that time period has passed, the chunks cache should not have any of the deleted data. The tombstone will move to the `processed` state and will no longer be used for query time filtering. ### Permanently deleting the data -To delete the data permanently from the block storage, deletion marker files will be used. The proposed approach is to perform the deletions from the compactor. A new background service inside the compactor called _DeletedSeriesCleaner_ can be created and is responsible for executing deletion. The process of permanently deleting the data can be separated into 2 stages, preprocessing and processing. - -#### Pre-processing - -This will happen after a grace period has passed once the API request has been made. By default, this should be 24 hours. The Purger will outline all the blocks that may contain data to be deleted. For each separate block that the deletion may be applicable to, the Purger will begin the process by adding a series deletion marker inside the series-deletion-marker.json file. The JSON file will contain an array of deletions that need to be applied to the block, which allows the ability to handle the situation when there are multiple tombstones that could be applicable to a particular block. +The proposed approach is to perform the deletions from the compactor. A new background service inside the compactor called _DeletedSeriesCleaner_ can be created and is responsible for executing the deletion. #### Processing -This will happen when the `DeleteRequest = StatusDeleting` in the deletion lifecycle. A background task can be created to process the permanent deletion of time series using the information inside the series-deletion-marker.json files. This can be done each hour. +This will happen after a grace period has passed once the API request has been made. By default this should be 24 hours. The state of the request becomes `Deleting`. A background task can be created to process the permanent deletion of time series. This background task can be executed each hour. To delete the data from the blocks, the same logic as the [Bucket Rewrite Tool](https://thanos.io/tip/components/tools.md/#bucket-rewrite ) from Thanos can be leveraged. This tool does the following: `tools bucket rewrite rewrites chosen blocks in the bucket, while deleting or modifying series`. The tool itself is a CLI tool that we won’t be using, but instead we can utilize the logic inside it. For more information about the way this tool runs, please see the code [here](https://github.com/thanos-io/thanos/blob/d8b21e708bee6d19f46ca32b158b0509ca9b7fed/cmd/thanos/tools_bucket.go#L809). The compactor’s _DeletedSeriesCleaner_ will apply this logic on individual blocks and each time it is run, it creates a new block without the data that matched the deletion request. The original individual blocks containing the data that was requested to be deleted, need to be marked for deletion by the compactor. -One important thing to note regarding this tool is that it should not be used at the same time as when another compactor is touching a block. If the tool is run at the same time as compaction on a particular block, it can cause overlap and the data marked for deletion can already be part of the compacted block. To mitigate such issues, these are some of the proposed solutions: +While deleting the data permanently from the block storage, the `meta.json` files will be used to keep track of the deletion progress. Inside each `meta.json` file, we will add a new field called `tombstonesFiltered`. This will store an array of deletion request id's that were used to create this block. Once the rewrite logic is applied to a block, the new block's `meta.json` file will append the deletion request id(s) used for the rewrite operation inside this field. This will let the DeletedSeriesCleaner know that this block has already processed the particular deletions requests listed in this field. Assuming that the deletion requests are quite rare, the size of the meta.json files should remain small. + +The DeletedSeriesCleaner can iterate through all the blocks that the deletion request could apply to. If the deletion request ID isn't inside the meta.json `tombstonesFiltered` field, then the compactor can apply the rewrite logic to this block. If there are multiple tombstones in the `Deleting` state that apply to a particular block, then the DeletedSeriesCleaner will process both at the same time to prevent additional blocks from being created. For each deletion request, once all the applicable blocks contain a meta.json file with the deletion request ID inside the `tombstonesFiltered` field, then the `Deleting` state is complete. + +One important thing to note regarding this rewrite tool is that it should not be used at the same time as when another compactor is touching a block. If the tool is run at the same time as compaction on a particular block, it can cause overlap and the data marked for deletion can already be part of the compacted block. To mitigate such issues, these are some of the proposed solutions: Option 1: Only apply the deletion once the blocks are in the final state of compaction. @@ -230,7 +197,7 @@ In the newly created block without the deleted time series data, the information To determine when a deletion request is complete, the purger will iterate through all the applicable blocks that might have data to be deleted. If there are any blocks that don’t have the tombstone ID in the meta.json of the block indicating the deletion has been complete, then the purger will add the series deletion markers to those blocks (if it doesn’t already exist). If after iterating through all blocks, it doesn’t find any such blocks, then that means the compactor has finished executing. -Once all the applicable blocks have been rewritten without the deleted data, the deletion request moves to `DeleteRequest = StatusProcessed` and the tombstone is deleted. +Once all the applicable blocks have been rewritten without the deleted data, the deletion request state moves to `Syncing`. Once the syncing time period is over, the state will advance to `Processed` state and the tombstone will no longer be used. @@ -241,19 +208,31 @@ Deletions will be completed and the tombstones will be deleted only when the Pur #### Tenant Deletion API -If a request is made to delete a tenant, then all the tombstones will be deleted for that user. For all the tombstones deleted, if there were any series deletion markers for the tombstones deleted, these will also need to be deleted prior to marking the tenant’s blocks for deletion. +If a request is made to delete a tenant, then all the tombstones will be deleted for that user. ## Current Open Questions: - If the start and end time is very far apart, it might result in a lot of the data being re-written. Since we create a new block without the deleted data and mark the old one for deletion, there may be a period of time with lots of extra blocks and space used for large deletion queries. -- Need to outline more clearly how this will work with multiple deletion requests at a time. +- There will be a delay between the deletion request and the deleted data being filtered during queires. + - In Prometheus, there is no delay. + - One way to filter out Immediately is to load the tombstones during query time but this will cause a negative performance impact. +- Adding limits to the API such as: + - The number of deletion requests per day, + - Number of requests allowed at a time + - How wide apart the start and end time can be. + +## Alternatives Considered +#### Adding a Pre-processing State -## Alternatives Considered +The process of permanently deleting the data can be separated into 2 stages, preprocessing and processing. + +This will happen after a grace period has passed once the API request has been made. The deletion request will move to a new state called `BuildingPlan`. The compactor will outline all the blocks that may contain data to be deleted. For each separate block that the deletion may be applicable to, the compactor will begin the process by adding a series deletion marker inside the series-deletion-marker.json file. The JSON file will contain an array of deletion request id's that need to be applied to the block, which allows the ability to handle the situation when there are multiple tombstones that could be applicable to a particular block. Then during the processing step, instead of checking the meta.json file, we only need to check if a marker file exists with a specific deletion request id. If the marker file exists, then we apply the rewrite logic. +#### Alternative Permanent Deletion Processing For processing the actual deletions, an alternative approach is not to wait until the final compaction has been completed and filter out the data during compaction. If the data is marked to be deleted, then don’t include it the new bigger block during compaction. For the remaining blocks where the data wasn’t filtered during compaction, the deletion can be done the same as in the previous section. From 30ca52de2acaf864cddcb44fd85ff427cb4f5977 Mon Sep 17 00:00:00 2001 From: ilangofman Date: Mon, 14 Jun 2021 18:53:00 -0400 Subject: [PATCH 04/14] Fix mention of Compactor instead of purger in proposal Signed-off-by: ilangofman --- docs/proposals/block-storage-time-series-deletion.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/docs/proposals/block-storage-time-series-deletion.md b/docs/proposals/block-storage-time-series-deletion.md index 4f003afb8e3..a9cc4cce250 100644 --- a/docs/proposals/block-storage-time-series-deletion.md +++ b/docs/proposals/block-storage-time-series-deletion.md @@ -168,7 +168,7 @@ The compactor’s _DeletedSeriesCleaner_ will apply this logic on individual blo While deleting the data permanently from the block storage, the `meta.json` files will be used to keep track of the deletion progress. Inside each `meta.json` file, we will add a new field called `tombstonesFiltered`. This will store an array of deletion request id's that were used to create this block. Once the rewrite logic is applied to a block, the new block's `meta.json` file will append the deletion request id(s) used for the rewrite operation inside this field. This will let the DeletedSeriesCleaner know that this block has already processed the particular deletions requests listed in this field. Assuming that the deletion requests are quite rare, the size of the meta.json files should remain small. -The DeletedSeriesCleaner can iterate through all the blocks that the deletion request could apply to. If the deletion request ID isn't inside the meta.json `tombstonesFiltered` field, then the compactor can apply the rewrite logic to this block. If there are multiple tombstones in the `Deleting` state that apply to a particular block, then the DeletedSeriesCleaner will process both at the same time to prevent additional blocks from being created. For each deletion request, once all the applicable blocks contain a meta.json file with the deletion request ID inside the `tombstonesFiltered` field, then the `Deleting` state is complete. +The DeletedSeriesCleaner can iterate through all the blocks that the deletion request could apply to. For each of these block, if the deletion request ID isn't inside the meta.json `tombstonesFiltered` field, then the compactor can apply the rewrite logic to this block. If there are multiple tombstones in the `Deleting` state that apply to a particular block, then the DeletedSeriesCleaner will process both at the same time to prevent additional blocks from being created. If after iterating through all blocks, it doesn’t find any such blocks, then the `Deleting` state is complete. One important thing to note regarding this rewrite tool is that it should not be used at the same time as when another compactor is touching a block. If the tool is run at the same time as compaction on a particular block, it can cause overlap and the data marked for deletion can already be part of the compacted block. To mitigate such issues, these are some of the proposed solutions: @@ -193,9 +193,7 @@ Cons: - Added coupling between the compaction and the DeletedSeriesCleaner. - Might block compaction for a short time while doing the deletion. -In the newly created block without the deleted time series data, the information about the deletion is added to the meta.json file. This will indicate which deletion requests have been filtered out of this new block. This is necessary because it will let the Purger service know that this block doesn’t need to be rewritten again. -To determine when a deletion request is complete, the purger will iterate through all the applicable blocks that might have data to be deleted. If there are any blocks that don’t have the tombstone ID in the meta.json of the block indicating the deletion has been complete, then the purger will add the series deletion markers to those blocks (if it doesn’t already exist). If after iterating through all blocks, it doesn’t find any such blocks, then that means the compactor has finished executing. Once all the applicable blocks have been rewritten without the deleted data, the deletion request state moves to `Syncing`. Once the syncing time period is over, the state will advance to `Processed` state and the tombstone will no longer be used. @@ -203,7 +201,7 @@ Once all the applicable blocks have been rewritten without the deleted data, the #### Handling failed/unfinished delete jobs: -Deletions will be completed and the tombstones will be deleted only when the Purger iterates over all blocks that match the time interval and confirms that they have been re-written without the deleted data. Otherwise, it will keep creating the markers indicating which blocks are remaining for deletion. In case of any failure that causes the deletion to stop, any unfinished deletions will be resumed once the service is restarted. The series deletion markers will remain in the bucket until the new blocks are created without the deleted data. Meaning that the compactor will continue to process the blocks for deletion that are remaining according to the deletion markers. +Deletions will be completed and the tombstones will be deleted only when the DeletedSeriesCleaner iterates over all blocks that match the time interval and confirms that they have been re-written without the deleted data. Otherwise, it will keep creating the markers indicating which blocks are remaining for deletion. In case of any failure that causes the deletion to stop, any unfinished deletions will be resumed once the service is restarted. The series deletion markers will remain in the bucket until the new blocks are created without the deleted data. Meaning that the compactor will continue to process the blocks for deletion that are remaining according to the deletion markers. #### Tenant Deletion API From 47f06d4e2440f6dc3d6136b66acdd59541c1d1eb Mon Sep 17 00:00:00 2001 From: ilangofman Date: Mon, 14 Jun 2021 23:33:58 -0400 Subject: [PATCH 05/14] Fixed wording and spelling in proposal Signed-off-by: ilangofman --- .../block-storage-time-series-deletion.md | 26 +++++++++---------- 1 file changed, 12 insertions(+), 14 deletions(-) diff --git a/docs/proposals/block-storage-time-series-deletion.md b/docs/proposals/block-storage-time-series-deletion.md index a9cc4cce250..aaf59715765 100644 --- a/docs/proposals/block-storage-time-series-deletion.md +++ b/docs/proposals/block-storage-time-series-deletion.md @@ -14,7 +14,7 @@ slug: block-storage-time-series-deletion Currently, Cortex only implements a time series deletion API for chunk storage. We present a design for implementing time series deletion with block storage. We would like to have the same API for deleting series as currently implemented in Prometheus and in Cortex with chunk storage. -This can be very important for users to have as confidential or accidental data might have been incorrectly pushed and needs to be removed. As well as potentially removing high cardinality data that is causing ineffecient queries. +This can be very important for users to have as confidential or accidental data might have been incorrectly pushed and needs to be removed. As well as potentially removing high cardinality data that is causing inefficient queries. ## Related works @@ -87,11 +87,11 @@ The amount of time for the request to move from `Received` to `Deleting` is depe ### Filtering data during queries while not yet deleted: -This will be done during the first 3 parts of the deletion lifecycle until the tombstone is deleted and the request's status becomes `Processed`. +This will be done during the first 3 states of the deletion lifecycle until the tombstone is deleted and the request's status becomes `Processed`. Once a deletion request is received, a tombstone entry will be created. The object store such as S3, GCS, Azure storage, can be used to store all the deletion requests. See the section below for more detail on how the tombstones will be stored. Using the tombstones, the querier will be able to filter the to-be-deleted data initially. In addition, the existing cache will be invalidated using cache generation numbers, which are described in the later sections. -The compactor will scan for new tombstone files and will update the bucket-index with the tombstone information regarding the deletion requests. This will enable the querier to priodically check the bucket index if there are any new tombstone files that are required for filtering. One drawback of this approach is the time it could take to start filtering the data. Since the compactor will update the bucket index with the new tombstones every `-compactor.cleanup-interval` (default 15 min). Then the cached bucket index is refreshed in the querier every `-blocks-storage.bucket-store.sync-interval` (default 15 min). Potentially could take almost 30 min for queriers to start filtering deleted data when using the default values. If the information requested for deletion is confidential/classified, the time delay is something that the user should be aware of, in addition to the time that the data has already been in Cortex. +The compactor will scan for new tombstone files and will update the bucket-index with the tombstone information regarding the deletion requests. This will enable the querier to periodically check the bucket index if there are any new tombstone files that are required to be used for filtering. One drawback of this approach is the time it could take to start filtering the data. Since the compactor will update the bucket index with the new tombstones every `-compactor.cleanup-interval` (default 15 min). Then the cached bucket index is refreshed in the querier every `-blocks-storage.bucket-store.sync-interval` (default 15 min). Potentially could take almost 30 min for queriers to start filtering deleted data when using the default values. If the information requested for deletion is confidential/classified, the time delay is something that the user should be aware of, in addition to the time that the data has already been in Cortex. An additional thing to consider is that this would mean that the bucket-index would have to be enabled for this API to work. Since the plan is to make to the bucket-index mandatory in the future for block storage, this shouldn't be an issue. @@ -100,7 +100,7 @@ Similar to the chunk storage deletion implementation, the initial filtering of t #### Storing tombstones in object store -The Purger will store the tombstone entries in a separate folder called `tombstones` in the object store (e.g. S3 bucket) in the respective tenant folder. Each tombstone can have a separate JSON file outlining all the necessary information about the deletion request such as the parameters passed in the request, as well as some meta-data such as the creation date of the request. The name of the file can be a hash of the API parameters (start, end, markers). This way if a user calls the API twice by accident with the same parameters, it will only create one tombstone. To keep track of the request state, filename extensions can be used. This will allow the tombstone files to be immutable. The 4 different file extensions will be `received, deleting, syncing, processed`. +The Purger will write the new tombstone entries in a separate folder called `tombstones` in the object store (e.g. S3 bucket) in the respective tenant folder. Each tombstone can have a separate JSON file outlining all the necessary information about the deletion request such as the parameters passed in the request, as well as some meta-data such as the creation date of the file. The name of the file can be a hash of the API parameters (start, end, markers). This way if a user calls the API twice by accident with the same parameters, it will only create one tombstone. To keep track of the request state, filename extensions can be used. This will allow the tombstone files to be immutable. The 4 different file extensions will be `received, deleting, syncing, processed`. Each time the deletion request moves to a new state, a new file will be added with the same content but a different extension to indicate the new state. The file containing the previous state will be deleted once the new one is created. The tombstone will be stored in a single JSON file per request and state: @@ -128,15 +128,13 @@ The schema of the JSON file is: Pros: - -- Design is similar to the existing chunk storage deletion - - Lots of code can be reused inside the purger component. - Allows deletion and un-delete to be done in a single operation. Cons: - Negative impact on query performance when there are active tombstones. As in the chunk storage implementation, all the series will have to be compared to the matchers contained in the active tombstone files. The impact on performance should be the same as the deletion would have with chunk storage. +- Potential 30 minute wait for the data to begin filtering if using the default configuration. #### Invalidating cache @@ -146,11 +144,11 @@ Using block store, the different caches available are: - Chunks cache (stores the potentially to be deleted chunks of data) - Query results cache (stores the potentially to be deleted data) -Using the tombstones, the queriers filter out the data received from the ingesters and store-gateway. The cache not being processed through the querier needs to be invalidated to prevent deleted data from coming up in queries. There are two potential caches that could contain deleted data, the chunks cache, and the query results cache. +There are two potential caches that could contain deleted data, the chunks cache, and the query results cache. Using the tombstones, the queriers filter out the data received from the ingesters and store-gateway. The cache not being processed through the querier needs to be invalidated to prevent deleted data from coming up in queries. -Firstly, the query results cache needs to be invalidated for each new delete request. This can be done using the same mechanism currently used for chunk storage by utilizing the cache generation numbers. For each tenant, their cache is prefixed with a cache generation number. This is already implemented into the middleware and would be easy to use for invalidating the cache. When the cache needs to be invalidated due to a delete or cancel delete request, the cache generation numbers would be increased (to the current timestamp), which would invalidate all the cache entries for a given tenant. The cache generation numbers are currently being stored in an Index table (e.g. DynamoDB or Bigtable). One option for block store is to store a per tenant key using the KV-store with the ring backend and propogate it using a Compare-And-Set/Swap (CAS) operation. If the current cache generation number is older than the KV-store is older or it is empty, then the cache is invalidated and the current timestamp becomes the cache generation number. +Firstly, the query results cache needs to be invalidated for each new delete request. This can be done using the same mechanism currently used for chunk storage by utilizing the cache generation numbers. For each tenant, their cache is prefixed with a cache generation number. This is already implemented into the middleware and would be easy to use for invalidating the cache. When the cache needs to be invalidated due to a delete or cancel delete request, the cache generation numbers would be increased (to the current timestamp), which would invalidate all the cache entries for a given tenant. With chunk store, the cache generation numbers are currently being stored in an Index table (e.g. DynamoDB or Bigtable). One option for block store is to save a per tenant key using the KV-store with the ring backend and propagate it using a Compare-And-Set/Swap (CAS) operation. If the current cache generation number is older than the KV-store is older or it is empty, then the cache is invalidated and the current timestamp becomes the cache generation number. -Furthermore, since the chunks cache is retrieved from the store gateway and passed to the querier, it will be filtered out like the rest of the time series data in the querier using the tombstones, with the mechanism described in the previous section. However, some issues may arise if the tombstone is deleted but the data to-be-deleted still exists in the chunks cache. To prevent this, we add another state to the deletion process called `syncing`. The tombstones will need to continue filtering the data until the store-gateway picks up the new blocks and the chunks cache is able to be refreshed with the new blocks without the deleted data. The `syncing` state will begin as soon as all the requested data has been permentantly deleted from the block store. This state will last `-compactor.deletion-delay + -compactor.cleanup-interval + -blocks-storage.bucket-store.sync-interval`. Once that time period has passed, the chunks cache should not have any of the deleted data. The tombstone will move to the `processed` state and will no longer be used for query time filtering. +Furthermore, since the chunks cache is retrieved from the store gateway and passed to the querier, it will be filtered out like the rest of the time series data in the querier using the tombstones, with the mechanism described in the previous section. However, some issues may arise if the tombstone is deleted but the data to-be-deleted still exists in the chunks cache. To prevent this, we add another state to the deletion process called `syncing`. The tombstones will need to continue filtering the data until the store-gateway picks up the new blocks and the chunks cache is able to be refreshed with the new blocks without the deleted data. The `syncing` state will begin as soon as all the requested data has been permanently deleted from the block store. This state will last `-compactor.deletion-delay + -compactor.cleanup-interval + -blocks-storage.bucket-store.sync-interval`. Once that time period has passed, the chunks cache should not have any of the deleted data. Then the tombstone will move to the `processed` state and will no longer be used for query time filtering. ### Permanently deleting the data @@ -166,9 +164,9 @@ To delete the data from the blocks, the same logic as the [Bucket Rewrite Tool]( The compactor’s _DeletedSeriesCleaner_ will apply this logic on individual blocks and each time it is run, it creates a new block without the data that matched the deletion request. The original individual blocks containing the data that was requested to be deleted, need to be marked for deletion by the compactor. -While deleting the data permanently from the block storage, the `meta.json` files will be used to keep track of the deletion progress. Inside each `meta.json` file, we will add a new field called `tombstonesFiltered`. This will store an array of deletion request id's that were used to create this block. Once the rewrite logic is applied to a block, the new block's `meta.json` file will append the deletion request id(s) used for the rewrite operation inside this field. This will let the DeletedSeriesCleaner know that this block has already processed the particular deletions requests listed in this field. Assuming that the deletion requests are quite rare, the size of the meta.json files should remain small. +While deleting the data permanently from the block storage, the `meta.json` files will be used to keep track of the deletion progress. Inside each `meta.json` file, we will add a new field called `tombstonesFiltered`. This will store an array of deletion request id's that were used to create this block. Once the rewrite logic is applied to a block, the new block's `meta.json` file will append the deletion request id(s) used for the rewrite operation inside this field. This will let the _DeletedSeriesCleaner_ know that this block has already processed the particular deletions requests listed in this field. Assuming that the deletion requests are quite rare, the size of the meta.json files should remain small. -The DeletedSeriesCleaner can iterate through all the blocks that the deletion request could apply to. For each of these block, if the deletion request ID isn't inside the meta.json `tombstonesFiltered` field, then the compactor can apply the rewrite logic to this block. If there are multiple tombstones in the `Deleting` state that apply to a particular block, then the DeletedSeriesCleaner will process both at the same time to prevent additional blocks from being created. If after iterating through all blocks, it doesn’t find any such blocks, then the `Deleting` state is complete. +The _DeletedSeriesCleaner_ can iterate through all the blocks that the deletion request could apply to. For each of these blocks, if the deletion request ID isn't inside the meta.json `tombstonesFiltered` field, then the compactor can apply the rewrite logic to this block. If there are multiple tombstones in the `Deleting` state that apply to a particular block, then the _DeletedSeriesCleaner_ will process both at the same time to prevent additional blocks from being created. If after iterating through all the blocks, it doesn’t find any such blocks requiring deletion, then the `Deleting` state is complete and the request progresses to the `Syncing` state. One important thing to note regarding this rewrite tool is that it should not be used at the same time as when another compactor is touching a block. If the tool is run at the same time as compaction on a particular block, it can cause overlap and the data marked for deletion can already be part of the compacted block. To mitigate such issues, these are some of the proposed solutions: @@ -213,7 +211,7 @@ If a request is made to delete a tenant, then all the tombstones will be deleted ## Current Open Questions: - If the start and end time is very far apart, it might result in a lot of the data being re-written. Since we create a new block without the deleted data and mark the old one for deletion, there may be a period of time with lots of extra blocks and space used for large deletion queries. -- There will be a delay between the deletion request and the deleted data being filtered during queires. +- There will be a delay between the deletion request and the deleted data being filtered during queries. - In Prometheus, there is no delay. - One way to filter out Immediately is to load the tombstones during query time but this will cause a negative performance impact. - Adding limits to the API such as: @@ -228,7 +226,7 @@ If a request is made to delete a tenant, then all the tombstones will be deleted The process of permanently deleting the data can be separated into 2 stages, preprocessing and processing. -This will happen after a grace period has passed once the API request has been made. The deletion request will move to a new state called `BuildingPlan`. The compactor will outline all the blocks that may contain data to be deleted. For each separate block that the deletion may be applicable to, the compactor will begin the process by adding a series deletion marker inside the series-deletion-marker.json file. The JSON file will contain an array of deletion request id's that need to be applied to the block, which allows the ability to handle the situation when there are multiple tombstones that could be applicable to a particular block. Then during the processing step, instead of checking the meta.json file, we only need to check if a marker file exists with a specific deletion request id. If the marker file exists, then we apply the rewrite logic. +Pre-processing will begin after the `-purger.delete-request-cancel-period` has passed since the API request has been made. The deletion request will move to a new state called `BuildingPlan`. The compactor will outline all the blocks that may contain data to be deleted. For each separate block that the deletion may be applicable to, the compactor will begin the process by adding a series deletion marker inside the series-deletion-marker.json file. The JSON file will contain an array of deletion request id's that need to be applied to the block, which allows the ability to handle the situation when there are multiple tombstones that could be applicable to a particular block. Then during the processing step, instead of checking the meta.json file, we only need to check if a marker file exists with a specific deletion request id. If the marker file exists, then we apply the rewrite logic. #### Alternative Permanent Deletion Processing From 0224010767b15c1a2aea6e150dc2de9d9b70ce40 Mon Sep 17 00:00:00 2001 From: ilangofman Date: Thu, 24 Jun 2021 00:29:14 -0400 Subject: [PATCH 06/14] Update the cache invalidation method Signed-off-by: ilangofman --- .../block-storage-time-series-deletion.md | 36 ++++++++++--------- 1 file changed, 20 insertions(+), 16 deletions(-) diff --git a/docs/proposals/block-storage-time-series-deletion.md b/docs/proposals/block-storage-time-series-deletion.md index aaf59715765..1b08e2bc3f3 100644 --- a/docs/proposals/block-storage-time-series-deletion.md +++ b/docs/proposals/block-storage-time-series-deletion.md @@ -77,19 +77,16 @@ Prometheus also implements a [clean_tombstones](https://prometheus.io/docs/prome The deletion request lifecycle can follow these 3 states: -1. Received - Tombstone file is created, just doing query time filtering -2. Deleting - Running delete operations and still doing query time filtering -3. Syncing - All requested data deleted, and still doing query time filtering. Waiting for the bucket index and store-gateway to pick up the new blocks and replace the old chunks cache. -4. Processed - All requested data deleted, chunks cache should contain new blocks and no longer doing query time filtering. +1. Pending - Tombstone file is created, just doing query time filtering. No data has been deleted yet. +2. Deleting - Running delete operations and still doing query time filtering. +4. Processed - All requested data deleted, chunks cache should contain new blocks. Initially, will still need to do query time filtering while waiting for the bucket index and store-gateway to pick up the new blocks and replace the old chunks cache. Once that period has passed, will no longer require any query time filtering. -The amount of time for the request to move from `Received` to `Deleting` is dependent on a config option: `-purger.delete-request-cancel-period`. The purpose of this is to allow the user some time to cancel the deletion request if it was made by mistake. +The amount of time for the request to move from `Pending` to `Deleting` is dependent on a config option: `-purger.delete-request-cancel-period`. The purpose of this is to allow the user some time to cancel the deletion request if it was made by mistake. ### Filtering data during queries while not yet deleted: -This will be done during the first 3 states of the deletion lifecycle until the tombstone is deleted and the request's status becomes `Processed`. - -Once a deletion request is received, a tombstone entry will be created. The object store such as S3, GCS, Azure storage, can be used to store all the deletion requests. See the section below for more detail on how the tombstones will be stored. Using the tombstones, the querier will be able to filter the to-be-deleted data initially. In addition, the existing cache will be invalidated using cache generation numbers, which are described in the later sections. +Once a deletion request is received, a tombstone entry will be created. The object store such as S3, GCS, Azure storage, can be used to store all the deletion requests. See the section below for more detail on how the tombstones will be stored. Using the tombstones, the querier will be able to filter the to-be-deleted data initially. If a cancel delete request is made, then the tombstone file will be deleted. In addition, the existing cache will be invalidated using cache generation numbers, which are described in the later sections. The compactor will scan for new tombstone files and will update the bucket-index with the tombstone information regarding the deletion requests. This will enable the querier to periodically check the bucket index if there are any new tombstone files that are required to be used for filtering. One drawback of this approach is the time it could take to start filtering the data. Since the compactor will update the bucket index with the new tombstones every `-compactor.cleanup-interval` (default 15 min). Then the cached bucket index is refreshed in the querier every `-blocks-storage.bucket-store.sync-interval` (default 15 min). Potentially could take almost 30 min for queriers to start filtering deleted data when using the default values. If the information requested for deletion is confidential/classified, the time delay is something that the user should be aware of, in addition to the time that the data has already been in Cortex. @@ -100,7 +97,9 @@ Similar to the chunk storage deletion implementation, the initial filtering of t #### Storing tombstones in object store -The Purger will write the new tombstone entries in a separate folder called `tombstones` in the object store (e.g. S3 bucket) in the respective tenant folder. Each tombstone can have a separate JSON file outlining all the necessary information about the deletion request such as the parameters passed in the request, as well as some meta-data such as the creation date of the file. The name of the file can be a hash of the API parameters (start, end, markers). This way if a user calls the API twice by accident with the same parameters, it will only create one tombstone. To keep track of the request state, filename extensions can be used. This will allow the tombstone files to be immutable. The 4 different file extensions will be `received, deleting, syncing, processed`. Each time the deletion request moves to a new state, a new file will be added with the same content but a different extension to indicate the new state. The file containing the previous state will be deleted once the new one is created. +The Purger will write the new tombstone entries in a separate folder called `tombstones` in the object store (e.g. S3 bucket) in the respective tenant folder. Each tombstone can have a separate JSON file outlining all the necessary information about the deletion request such as the parameters passed in the request, as well as some meta-data such as the creation date of the file. The name of the file can be a hash of the API parameters (start, end, markers). This way if a user calls the API twice by accident with the same parameters, it will only create one tombstone. To keep track of the request state, filename extensions can be used. This will allow the tombstone files to be immutable. The 3 different file extensions will be `pending, deleting, processed`. Each time the deletion request moves to a new state, a new file will be added with the same content but a different extension to indicate the new state. The file containing the previous state will be deleted once the new one is created. + +Updating the states will be done from the compactor. Inside the new _DeletedSeriesCleaner_ service, it will periodically check all the tombstones to see if their current state is ready to be upgraded. If it is determined that the request should move to the next state, then it will first write a new file containing the tombstone information. The information inside the file will be the same except the `creationTime`, which is replaced with the current timestamp. The extension of the new file will be different to reflect the new state. If the new file is successfully written, the file with the previous state is deleted. If the write of the new file fails, then the previous file is not going to be deleted. Next time the service runs to check the state of each tombstone, it will retry creating the new file with the updated state. If the write is successful but the deletion of the old file is unsuccessful then there will be 2 tombstone files with the same filename but different extension. When writing to the bucket index, the compactor will check for duplicate tombstone files but with different extensions. It will use the tombstone with the most recently updated state and try to delete the file with the older state. The tombstone will be stored in a single JSON file per request and state: @@ -134,7 +133,7 @@ Cons: - Negative impact on query performance when there are active tombstones. As in the chunk storage implementation, all the series will have to be compared to the matchers contained in the active tombstone files. The impact on performance should be the same as the deletion would have with chunk storage. -- Potential 30 minute wait for the data to begin filtering if using the default configuration. +- With the default config, potential 30 minute wait for the data to begin filtering if using the default configuration. #### Invalidating cache @@ -146,9 +145,14 @@ Using block store, the different caches available are: There are two potential caches that could contain deleted data, the chunks cache, and the query results cache. Using the tombstones, the queriers filter out the data received from the ingesters and store-gateway. The cache not being processed through the querier needs to be invalidated to prevent deleted data from coming up in queries. -Firstly, the query results cache needs to be invalidated for each new delete request. This can be done using the same mechanism currently used for chunk storage by utilizing the cache generation numbers. For each tenant, their cache is prefixed with a cache generation number. This is already implemented into the middleware and would be easy to use for invalidating the cache. When the cache needs to be invalidated due to a delete or cancel delete request, the cache generation numbers would be increased (to the current timestamp), which would invalidate all the cache entries for a given tenant. With chunk store, the cache generation numbers are currently being stored in an Index table (e.g. DynamoDB or Bigtable). One option for block store is to save a per tenant key using the KV-store with the ring backend and propagate it using a Compare-And-Set/Swap (CAS) operation. If the current cache generation number is older than the KV-store is older or it is empty, then the cache is invalidated and the current timestamp becomes the cache generation number. +Firstly, the query results cache needs to be invalidated for each new delete request or a cancel delete request. This can be done using the same mechanism currently used for chunk storage by utilizing the cache generation numbers. For each tenant, their cache is prefixed with a cache generation number. This is already implemented into the middleware and would be easy to use for invalidating the cache. When the cache needs to be invalidated due to a delete or cancel delete request, the cache generation numbers would be increased (to the current timestamp), which would invalidate all the cache entries for a given tenant. With chunk store, the cache generation numbers are currently being stored in an Index table (e.g. DynamoDB or Bigtable). One option for block store is to save a per tenant key using the KV-store with the ring backend and propagate it using a Compare-And-Set/Swap (CAS) operation. If the current cache generation number is older than the KV-store is older or it is empty, then the cache is invalidated and the current timestamp becomes the cache generation number. However, since the tombstones are uploaded asynchronously to the queriers, the results from the queriers might contain deleted data after the cache has been already invalidated. That would mean that the results can't be cached until there is a guarantee that the querier's responses have been processed using the most up to date tombstones. Here is the proposed method for ensuring this: + +- When a deletion request is made or cancelled, the cache generation number is incremented for the given tenant to the current timestamp. +- Inside the query frontend, the cache generation number will be outdated, so the cache is invalidated. +- When the compactor writes the tombstones to the bucket index, it will include the timestamp of when the write occurred. When the querier reads from the bucket index, it will store this timestamp. +- Inside the query frontend, if a response is not found in cache, it will ask at least one querier. Inside the response of each querier, it will include the timestamp of when the compactor wrote to the bucket index. The query frontend will compare the minimum timestamp that was returned by the queriers to the cache generation number (timestamp). If the minimum timestamp is larger than the cache generation number, that means that all the queriers loaded the bucket index after the tombstone have been created/deleted. If this is the case, then the response from each of the queriers should have the correct data and the results can be cached inside the query frontend. Otherwise, the response of the queriers might contain data that was not processed using the most up-to-date tombstones, and the results will not be cached. Using the minimum timestamp returned by the queriers, we can guarantee that the queriers have the most recent tombstones. -Furthermore, since the chunks cache is retrieved from the store gateway and passed to the querier, it will be filtered out like the rest of the time series data in the querier using the tombstones, with the mechanism described in the previous section. However, some issues may arise if the tombstone is deleted but the data to-be-deleted still exists in the chunks cache. To prevent this, we add another state to the deletion process called `syncing`. The tombstones will need to continue filtering the data until the store-gateway picks up the new blocks and the chunks cache is able to be refreshed with the new blocks without the deleted data. The `syncing` state will begin as soon as all the requested data has been permanently deleted from the block store. This state will last `-compactor.deletion-delay + -compactor.cleanup-interval + -blocks-storage.bucket-store.sync-interval`. Once that time period has passed, the chunks cache should not have any of the deleted data. Then the tombstone will move to the `processed` state and will no longer be used for query time filtering. +Furthermore, since the chunks cache is retrieved from the store gateway and passed to the querier, it will be filtered out like the rest of the time series data in the querier using the tombstones, with the mechanism described in the previous section. However, some issues may arise if the tombstone is deleted but the data to-be-deleted still exists in the chunks cache. To prevent this, we keep performing query time filtering using the tombstones once all the data has been deleted for an additional period of time. The filtering would be required until the store-gateway picks up the new blocks and the chunks cache is able to be refreshed with the new blocks without the deleted data. The `processed` state will begin as soon as all the requested data has been permanently deleted from the block store. Once in this state, the query time filtering will last for a length of `-compactor.deletion-delay + -compactor.cleanup-interval + -blocks-storage.bucket-store.sync-interval`. Once that time period has passed, the chunks cache should not have any of the deleted data. Then the tombstone will no longer be used for query time filtering. ### Permanently deleting the data @@ -166,7 +170,7 @@ The compactor’s _DeletedSeriesCleaner_ will apply this logic on individual blo While deleting the data permanently from the block storage, the `meta.json` files will be used to keep track of the deletion progress. Inside each `meta.json` file, we will add a new field called `tombstonesFiltered`. This will store an array of deletion request id's that were used to create this block. Once the rewrite logic is applied to a block, the new block's `meta.json` file will append the deletion request id(s) used for the rewrite operation inside this field. This will let the _DeletedSeriesCleaner_ know that this block has already processed the particular deletions requests listed in this field. Assuming that the deletion requests are quite rare, the size of the meta.json files should remain small. -The _DeletedSeriesCleaner_ can iterate through all the blocks that the deletion request could apply to. For each of these blocks, if the deletion request ID isn't inside the meta.json `tombstonesFiltered` field, then the compactor can apply the rewrite logic to this block. If there are multiple tombstones in the `Deleting` state that apply to a particular block, then the _DeletedSeriesCleaner_ will process both at the same time to prevent additional blocks from being created. If after iterating through all the blocks, it doesn’t find any such blocks requiring deletion, then the `Deleting` state is complete and the request progresses to the `Syncing` state. +The _DeletedSeriesCleaner_ can iterate through all the blocks that the deletion request could apply to. For each of these blocks, if the deletion request ID isn't inside the meta.json `tombstonesFiltered` field, then the compactor can apply the rewrite logic to this block. If there are multiple tombstones in the `Deleting` state that apply to a particular block, then the _DeletedSeriesCleaner_ will process both at the same time to prevent additional blocks from being created. If after iterating through all the blocks, it doesn’t find any such blocks requiring deletion, then the `Deleting` state is complete and the request progresses to the `Processed` state. One important thing to note regarding this rewrite tool is that it should not be used at the same time as when another compactor is touching a block. If the tool is run at the same time as compaction on a particular block, it can cause overlap and the data marked for deletion can already be part of the compacted block. To mitigate such issues, these are some of the proposed solutions: @@ -193,13 +197,13 @@ Cons: -Once all the applicable blocks have been rewritten without the deleted data, the deletion request state moves to `Syncing`. Once the syncing time period is over, the state will advance to `Processed` state and the tombstone will no longer be used. +Once all the applicable blocks have been rewritten without the deleted data, the deletion request state moves to `Processed`. Once in this state, the queriers will still have to perform query time filtering using the tombstones for a period of time. This period of time is described in the cache invalidation section. Once the time period is over, the tombstone will no longer be used. #### Handling failed/unfinished delete jobs: -Deletions will be completed and the tombstones will be deleted only when the DeletedSeriesCleaner iterates over all blocks that match the time interval and confirms that they have been re-written without the deleted data. Otherwise, it will keep creating the markers indicating which blocks are remaining for deletion. In case of any failure that causes the deletion to stop, any unfinished deletions will be resumed once the service is restarted. The series deletion markers will remain in the bucket until the new blocks are created without the deleted data. Meaning that the compactor will continue to process the blocks for deletion that are remaining according to the deletion markers. +Deletions will be completed and the tombstones will be deleted only when the DeletedSeriesCleaner iterates over all blocks that match the time interval and confirms that they have been re-written without the deleted data. Otherwise, it will keep iterating over the blocks and process the blocks that haven't been rewritten according to the information in the `meta.json` file. In case of any failure that causes the deletion to stop, any unfinished deletions will be resumed once the service is restarted. If the block rewrite was not completed on a particular block, then the block will not be marked for deletion. The compactor will continue to iterate over the blocks and process the block again. #### Tenant Deletion API @@ -213,7 +217,7 @@ If a request is made to delete a tenant, then all the tombstones will be deleted - If the start and end time is very far apart, it might result in a lot of the data being re-written. Since we create a new block without the deleted data and mark the old one for deletion, there may be a period of time with lots of extra blocks and space used for large deletion queries. - There will be a delay between the deletion request and the deleted data being filtered during queries. - In Prometheus, there is no delay. - - One way to filter out Immediately is to load the tombstones during query time but this will cause a negative performance impact. + - One way to filter out immediately is to load the tombstones during query time but this will cause a negative performance impact. - Adding limits to the API such as: - The number of deletion requests per day, - Number of requests allowed at a time From 73db5428681f97ac7372a7902aca76d9e9f64e24 Mon Sep 17 00:00:00 2001 From: ilangofman Date: Thu, 24 Jun 2021 00:55:01 -0400 Subject: [PATCH 07/14] Fix wording on cache invalidation section Signed-off-by: ilangofman --- docs/proposals/block-storage-time-series-deletion.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/proposals/block-storage-time-series-deletion.md b/docs/proposals/block-storage-time-series-deletion.md index 1b08e2bc3f3..752f0c7468a 100644 --- a/docs/proposals/block-storage-time-series-deletion.md +++ b/docs/proposals/block-storage-time-series-deletion.md @@ -145,12 +145,12 @@ Using block store, the different caches available are: There are two potential caches that could contain deleted data, the chunks cache, and the query results cache. Using the tombstones, the queriers filter out the data received from the ingesters and store-gateway. The cache not being processed through the querier needs to be invalidated to prevent deleted data from coming up in queries. -Firstly, the query results cache needs to be invalidated for each new delete request or a cancel delete request. This can be done using the same mechanism currently used for chunk storage by utilizing the cache generation numbers. For each tenant, their cache is prefixed with a cache generation number. This is already implemented into the middleware and would be easy to use for invalidating the cache. When the cache needs to be invalidated due to a delete or cancel delete request, the cache generation numbers would be increased (to the current timestamp), which would invalidate all the cache entries for a given tenant. With chunk store, the cache generation numbers are currently being stored in an Index table (e.g. DynamoDB or Bigtable). One option for block store is to save a per tenant key using the KV-store with the ring backend and propagate it using a Compare-And-Set/Swap (CAS) operation. If the current cache generation number is older than the KV-store is older or it is empty, then the cache is invalidated and the current timestamp becomes the cache generation number. However, since the tombstones are uploaded asynchronously to the queriers, the results from the queriers might contain deleted data after the cache has been already invalidated. That would mean that the results can't be cached until there is a guarantee that the querier's responses have been processed using the most up to date tombstones. Here is the proposed method for ensuring this: +Firstly, the query results cache needs to be invalidated for each new delete request or a cancel delete request. This can be done using the same mechanism currently used for chunk storage by utilizing the cache generation numbers. For each tenant, their cache is prefixed with a cache generation number. This is already implemented into the middleware and would be easy to use for invalidating the cache. When the cache needs to be invalidated due to a delete or cancel delete request, the cache generation numbers would be increased (to the current timestamp), which would invalidate all the cache entries for a given tenant. With chunk store, the cache generation numbers are currently being stored in an Index table (e.g. DynamoDB or Bigtable). One option for block store is to save a per tenant key using the KV-store with the ring backend and propagate it using a Compare-And-Set/Swap (CAS) operation. Inside the query frontend, if the current cache generation number is older than the one in KV-store or it is empty, then the cache is invalidated and the cache generation number is updated. However, since the tombstones are uploaded asynchronously to the queriers, the results from the queriers might contain deleted data after the cache has been already invalidated. That would mean that the results can't be cached until there is a guarantee that the querier's responses have been processed using the most up to date tombstones. Here is the proposed method for ensuring this: -- When a deletion request is made or cancelled, the cache generation number is incremented for the given tenant to the current timestamp. -- Inside the query frontend, the cache generation number will be outdated, so the cache is invalidated. +- When a deletion request is made or cancelled, the cache generation number is incremented for the given tenant in the KV-store to the current timestamp. +- Inside the query frontend, the cache generation number will be outdated, so the cache is invalidated. The query frontend will now store the most recent cache generation number. - When the compactor writes the tombstones to the bucket index, it will include the timestamp of when the write occurred. When the querier reads from the bucket index, it will store this timestamp. -- Inside the query frontend, if a response is not found in cache, it will ask at least one querier. Inside the response of each querier, it will include the timestamp of when the compactor wrote to the bucket index. The query frontend will compare the minimum timestamp that was returned by the queriers to the cache generation number (timestamp). If the minimum timestamp is larger than the cache generation number, that means that all the queriers loaded the bucket index after the tombstone have been created/deleted. If this is the case, then the response from each of the queriers should have the correct data and the results can be cached inside the query frontend. Otherwise, the response of the queriers might contain data that was not processed using the most up-to-date tombstones, and the results will not be cached. Using the minimum timestamp returned by the queriers, we can guarantee that the queriers have the most recent tombstones. +- Inside the query frontend, if a response is not found in cache, it will ask at least one querier. Inside the response of each querier, it will include the timestamp of when the compactor wrote to the bucket index. The query frontend will compare the minimum timestamp that was returned by the queriers to the cache generation number (timestamp). If the minimum timestamp is larger than the cache generation number, that means that all the queriers loaded the bucket index after the compactor has updated it with the most up-to-date tombstones. If this is the case, then the response from each of the queriers should have the correct data and the results can be cached inside the query frontend. Otherwise, the response of the queriers might contain data that was not processed using the most up-to-date tombstones, and the results should not be cached. Using the minimum timestamp returned by the queriers, we can guarantee that the queriers have the most recent tombstones. Furthermore, since the chunks cache is retrieved from the store gateway and passed to the querier, it will be filtered out like the rest of the time series data in the querier using the tombstones, with the mechanism described in the previous section. However, some issues may arise if the tombstone is deleted but the data to-be-deleted still exists in the chunks cache. To prevent this, we keep performing query time filtering using the tombstones once all the data has been deleted for an additional period of time. The filtering would be required until the store-gateway picks up the new blocks and the chunks cache is able to be refreshed with the new blocks without the deleted data. The `processed` state will begin as soon as all the requested data has been permanently deleted from the block store. Once in this state, the query time filtering will last for a length of `-compactor.deletion-delay + -compactor.cleanup-interval + -blocks-storage.bucket-store.sync-interval`. Once that time period has passed, the chunks cache should not have any of the deleted data. Then the tombstone will no longer be used for query time filtering. From ffd079e77273c6138df438d5de2f5163ebef0285 Mon Sep 17 00:00:00 2001 From: ilangofman Date: Thu, 24 Jun 2021 10:20:53 -0400 Subject: [PATCH 08/14] Minor wording additions Signed-off-by: ilangofman --- docs/proposals/block-storage-time-series-deletion.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/proposals/block-storage-time-series-deletion.md b/docs/proposals/block-storage-time-series-deletion.md index 752f0c7468a..f3120210757 100644 --- a/docs/proposals/block-storage-time-series-deletion.md +++ b/docs/proposals/block-storage-time-series-deletion.md @@ -99,7 +99,7 @@ Similar to the chunk storage deletion implementation, the initial filtering of t The Purger will write the new tombstone entries in a separate folder called `tombstones` in the object store (e.g. S3 bucket) in the respective tenant folder. Each tombstone can have a separate JSON file outlining all the necessary information about the deletion request such as the parameters passed in the request, as well as some meta-data such as the creation date of the file. The name of the file can be a hash of the API parameters (start, end, markers). This way if a user calls the API twice by accident with the same parameters, it will only create one tombstone. To keep track of the request state, filename extensions can be used. This will allow the tombstone files to be immutable. The 3 different file extensions will be `pending, deleting, processed`. Each time the deletion request moves to a new state, a new file will be added with the same content but a different extension to indicate the new state. The file containing the previous state will be deleted once the new one is created. -Updating the states will be done from the compactor. Inside the new _DeletedSeriesCleaner_ service, it will periodically check all the tombstones to see if their current state is ready to be upgraded. If it is determined that the request should move to the next state, then it will first write a new file containing the tombstone information. The information inside the file will be the same except the `creationTime`, which is replaced with the current timestamp. The extension of the new file will be different to reflect the new state. If the new file is successfully written, the file with the previous state is deleted. If the write of the new file fails, then the previous file is not going to be deleted. Next time the service runs to check the state of each tombstone, it will retry creating the new file with the updated state. If the write is successful but the deletion of the old file is unsuccessful then there will be 2 tombstone files with the same filename but different extension. When writing to the bucket index, the compactor will check for duplicate tombstone files but with different extensions. It will use the tombstone with the most recently updated state and try to delete the file with the older state. +Updating the states will be done from the compactor. Inside the compactor, the new _DeletedSeriesCleaner_ service will periodically check all the tombstones to see if their current state is ready to be upgraded. If it is determined that the request should move to the next state, then it will first write a new file containing the tombstone information. The information inside the file will be the same except the `creationTime`, which is replaced with the current timestamp. The extension of the new file will be different to reflect the new state. If the new file is successfully written, the file with the previous state is deleted. If the write of the new file fails, then the previous file is not going to be deleted. Next time the service runs to check the state of each tombstone, it will retry creating the new file with the updated state. If the write is successful but the deletion of the old file is unsuccessful then there will be 2 tombstone files with the same filename but different extension. When writing to the bucket index, the compactor will check for duplicate tombstone files but with different extensions. It will use the tombstone with the most recently updated state and try to delete the file with the older state. The tombstone will be stored in a single JSON file per request and state: @@ -203,7 +203,7 @@ Once all the applicable blocks have been rewritten without the deleted data, the #### Handling failed/unfinished delete jobs: -Deletions will be completed and the tombstones will be deleted only when the DeletedSeriesCleaner iterates over all blocks that match the time interval and confirms that they have been re-written without the deleted data. Otherwise, it will keep iterating over the blocks and process the blocks that haven't been rewritten according to the information in the `meta.json` file. In case of any failure that causes the deletion to stop, any unfinished deletions will be resumed once the service is restarted. If the block rewrite was not completed on a particular block, then the block will not be marked for deletion. The compactor will continue to iterate over the blocks and process the block again. +Deletions will be completed and the tombstones will be deleted only when the DeletedSeriesCleaner iterates over all blocks that match the time interval and confirms that they have been re-written without the deleted data. Otherwise, it will keep iterating over the blocks and process the blocks that haven't been rewritten according to the information in the `meta.json` file. In case of any failure that causes the deletion to stop, any unfinished deletions will be resumed once the service is restarted. If the block rewrite was not completed on a particular block, then the original block will not be marked for deletion. The compactor will continue to iterate over the blocks and process the block again. #### Tenant Deletion API From 2f30ab57f71d1e3db944db457bfa037f4613e611 Mon Sep 17 00:00:00 2001 From: ilangofman Date: Wed, 30 Jun 2021 22:22:30 -0400 Subject: [PATCH 09/14] Remove white-noise from text Signed-off-by: ilangofman --- .../block-storage-time-series-deletion.md | 118 +++++++++--------- 1 file changed, 59 insertions(+), 59 deletions(-) diff --git a/docs/proposals/block-storage-time-series-deletion.md b/docs/proposals/block-storage-time-series-deletion.md index f3120210757..fcabe043205 100644 --- a/docs/proposals/block-storage-time-series-deletion.md +++ b/docs/proposals/block-storage-time-series-deletion.md @@ -11,14 +11,14 @@ slug: block-storage-time-series-deletion ## Problem -Currently, Cortex only implements a time series deletion API for chunk storage. We present a design for implementing time series deletion with block storage. We would like to have the same API for deleting series as currently implemented in Prometheus and in Cortex with chunk storage. +Currently, Cortex only implements a time series deletion API for chunk storage. We present a design for implementing time series deletion with block storage. We would like to have the same API for deleting series as currently implemented in Prometheus and in Cortex with chunk storage. -This can be very important for users to have as confidential or accidental data might have been incorrectly pushed and needs to be removed. As well as potentially removing high cardinality data that is causing inefficient queries. +This can be very important for users to have as confidential or accidental data might have been incorrectly pushed and needs to be removed. As well as potentially removing high cardinality data that is causing inefficient queries. ## Related works -As previously mentioned, the deletion feature is already implemented with chunk storage. The main functionality is implemented through the purger service. It accepts requests for deletion and processes them. At first, when a deletion request is made, a tombstone is created. This is used to filter out the data for queries. After some time, a deletion plan is executed where the data is permanently removed from chunk storage. +As previously mentioned, the deletion feature is already implemented with chunk storage. The main functionality is implemented through the purger service. It accepts requests for deletion and processes them. At first, when a deletion request is made, a tombstone is created. This is used to filter out the data for queries. After some time, a deletion plan is executed where the data is permanently removed from chunk storage. Can find more info here: @@ -33,22 +33,22 @@ With a block-storage configuration, Cortex stores data that could be potentially - Object store (GCS, S3, etc..) for long term storage of blocks - Ingesters for more recent data that should be eventually transferred to the object store -- Cache +- Cache - Index cache - Metadata cache - - Chunks cache (stores the potentially to be deleted data) - - Query results cache (stores the potentially to be deleted data) + - Chunks cache (stores the potentially to be deleted data) + - Query results cache (stores the potentially to be deleted data) - Compactor during the compaction process - Store-gateway ## Proposal -The deletion will not happen right away. Initially, the data will be filtered out from queries using tombstones and will be deleted afterward. This will allow the user some time to cancel the delete request. +The deletion will not happen right away. Initially, the data will be filtered out from queries using tombstones and will be deleted afterward. This will allow the user some time to cancel the delete request. ### API Endpoints -The existing purger service will be used to process the incoming requests for deletion. The API will follow the same structure as the chunk storage endpoints for deletion, which is also based on the Prometheus deletion API. +The existing purger service will be used to process the incoming requests for deletion. The API will follow the same structure as the chunk storage endpoints for deletion, which is also based on the Prometheus deletion API. This will enable the following endpoints for Cortex when using block storage: @@ -57,51 +57,51 @@ This will enable the following endpoints for Cortex when using block storage: Parameters: - `start=` - - Optional. If not provided, will be set to minimum possible time. + - Optional. If not provided, will be set to minimum possible time. - `end= ` - - Optional. If not provided, will be set to maximum possible time (time when request was made). End time cannot be greater than the current UTC time. + - Optional. If not provided, will be set to maximum possible time (time when request was made). End time cannot be greater than the current UTC time. - `match[]=` - - Cannot be empty, must contain at least one label matcher argument. + - Cannot be empty, must contain at least one label matcher argument. -`POST /api/v1/admin/tsdb/cancel_delete_request` - To cancel a request if it has not been processed yet for permanent deletion. This can only be done before the `-purger.delete-request-cancel-period` has passed. +`POST /api/v1/admin/tsdb/cancel_delete_request` - To cancel a request if it has not been processed yet for permanent deletion. This can only be done before the `-purger.delete-request-cancel-period` has passed. Parameters: - `request_id` -`GET /api/v1/admin/tsdb/delete_series` - Get all delete requests id’s and their current status. +`GET /api/v1/admin/tsdb/delete_series` - Get all delete requests id’s and their current status. -Prometheus also implements a [clean_tombstones](https://prometheus.io/docs/prometheus/latest/querying/api/#clean-tombstones) API which is not included in this proposal. The tombstones will be deleted automatically once the permanent deletion has taken place which is described in the section below. By default, this should take approximately 24 hours. +Prometheus also implements a [clean_tombstones](https://prometheus.io/docs/prometheus/latest/querying/api/#clean-tombstones) API which is not included in this proposal. The tombstones will be deleted automatically once the permanent deletion has taken place which is described in the section below. By default, this should take approximately 24 hours. ### Deletion Lifecycle The deletion request lifecycle can follow these 3 states: -1. Pending - Tombstone file is created, just doing query time filtering. No data has been deleted yet. +1. Pending - Tombstone file is created, just doing query time filtering. No data has been deleted yet. 2. Deleting - Running delete operations and still doing query time filtering. -4. Processed - All requested data deleted, chunks cache should contain new blocks. Initially, will still need to do query time filtering while waiting for the bucket index and store-gateway to pick up the new blocks and replace the old chunks cache. Once that period has passed, will no longer require any query time filtering. +4. Processed - All requested data deleted, chunks cache should contain new blocks. Initially, will still need to do query time filtering while waiting for the bucket index and store-gateway to pick up the new blocks and replace the old chunks cache. Once that period has passed, will no longer require any query time filtering. -The amount of time for the request to move from `Pending` to `Deleting` is dependent on a config option: `-purger.delete-request-cancel-period`. The purpose of this is to allow the user some time to cancel the deletion request if it was made by mistake. +The amount of time for the request to move from `Pending` to `Deleting` is dependent on a config option: `-purger.delete-request-cancel-period`. The purpose of this is to allow the user some time to cancel the deletion request if it was made by mistake. ### Filtering data during queries while not yet deleted: -Once a deletion request is received, a tombstone entry will be created. The object store such as S3, GCS, Azure storage, can be used to store all the deletion requests. See the section below for more detail on how the tombstones will be stored. Using the tombstones, the querier will be able to filter the to-be-deleted data initially. If a cancel delete request is made, then the tombstone file will be deleted. In addition, the existing cache will be invalidated using cache generation numbers, which are described in the later sections. +Once a deletion request is received, a tombstone entry will be created. The object store such as S3, GCS, Azure storage, can be used to store all the deletion requests. See the section below for more detail on how the tombstones will be stored. Using the tombstones, the querier will be able to filter the to-be-deleted data initially. If a cancel delete request is made, then the tombstone file will be deleted. In addition, the existing cache will be invalidated using cache generation numbers, which are described in the later sections. The compactor will scan for new tombstone files and will update the bucket-index with the tombstone information regarding the deletion requests. This will enable the querier to periodically check the bucket index if there are any new tombstone files that are required to be used for filtering. One drawback of this approach is the time it could take to start filtering the data. Since the compactor will update the bucket index with the new tombstones every `-compactor.cleanup-interval` (default 15 min). Then the cached bucket index is refreshed in the querier every `-blocks-storage.bucket-store.sync-interval` (default 15 min). Potentially could take almost 30 min for queriers to start filtering deleted data when using the default values. If the information requested for deletion is confidential/classified, the time delay is something that the user should be aware of, in addition to the time that the data has already been in Cortex. -An additional thing to consider is that this would mean that the bucket-index would have to be enabled for this API to work. Since the plan is to make to the bucket-index mandatory in the future for block storage, this shouldn't be an issue. +An additional thing to consider is that this would mean that the bucket-index would have to be enabled for this API to work. Since the plan is to make to the bucket-index mandatory in the future for block storage, this shouldn't be an issue. Similar to the chunk storage deletion implementation, the initial filtering of the deleted data will be done inside the Querier. This will allow filtering the data read from both the store gateway and the ingester. This functionality already exists for the chunk storage implementation. By implementing it in the querier, this would mean that the ruler will be supported too (ruler internally runs the querier). -#### Storing tombstones in object store +#### Storing tombstones in object store -The Purger will write the new tombstone entries in a separate folder called `tombstones` in the object store (e.g. S3 bucket) in the respective tenant folder. Each tombstone can have a separate JSON file outlining all the necessary information about the deletion request such as the parameters passed in the request, as well as some meta-data such as the creation date of the file. The name of the file can be a hash of the API parameters (start, end, markers). This way if a user calls the API twice by accident with the same parameters, it will only create one tombstone. To keep track of the request state, filename extensions can be used. This will allow the tombstone files to be immutable. The 3 different file extensions will be `pending, deleting, processed`. Each time the deletion request moves to a new state, a new file will be added with the same content but a different extension to indicate the new state. The file containing the previous state will be deleted once the new one is created. +The Purger will write the new tombstone entries in a separate folder called `tombstones` in the object store (e.g. S3 bucket) in the respective tenant folder. Each tombstone can have a separate JSON file outlining all the necessary information about the deletion request such as the parameters passed in the request, as well as some meta-data such as the creation date of the file. The name of the file can be a hash of the API parameters (start, end, markers). This way if a user calls the API twice by accident with the same parameters, it will only create one tombstone. To keep track of the request state, filename extensions can be used. This will allow the tombstone files to be immutable. The 3 different file extensions will be `pending, deleting, processed`. Each time the deletion request moves to a new state, a new file will be added with the same content but a different extension to indicate the new state. The file containing the previous state will be deleted once the new one is created. -Updating the states will be done from the compactor. Inside the compactor, the new _DeletedSeriesCleaner_ service will periodically check all the tombstones to see if their current state is ready to be upgraded. If it is determined that the request should move to the next state, then it will first write a new file containing the tombstone information. The information inside the file will be the same except the `creationTime`, which is replaced with the current timestamp. The extension of the new file will be different to reflect the new state. If the new file is successfully written, the file with the previous state is deleted. If the write of the new file fails, then the previous file is not going to be deleted. Next time the service runs to check the state of each tombstone, it will retry creating the new file with the updated state. If the write is successful but the deletion of the old file is unsuccessful then there will be 2 tombstone files with the same filename but different extension. When writing to the bucket index, the compactor will check for duplicate tombstone files but with different extensions. It will use the tombstone with the most recently updated state and try to delete the file with the older state. +Updating the states will be done from the compactor. Inside the compactor, the new _DeletedSeriesCleaner_ service will periodically check all the tombstones to see if their current state is ready to be upgraded. If it is determined that the request should move to the next state, then it will first write a new file containing the tombstone information. The information inside the file will be the same except the `creationTime`, which is replaced with the current timestamp. The extension of the new file will be different to reflect the new state. If the new file is successfully written, the file with the previous state is deleted. If the write of the new file fails, then the previous file is not going to be deleted. Next time the service runs to check the state of each tombstone, it will retry creating the new file with the updated state. If the write is successful but the deletion of the old file is unsuccessful then there will be 2 tombstone files with the same filename but different extension. When writing to the bucket index, the compactor will check for duplicate tombstone files but with different extensions. It will use the tombstone with the most recently updated state and try to delete the file with the older state. -The tombstone will be stored in a single JSON file per request and state: +The tombstone will be stored in a single JSON file per request and state: - `//tombstones/.json.` @@ -117,7 +117,7 @@ The schema of the JSON file is: "creationTime": , "matchers": [ "", - .., + .., "" ] }, @@ -127,48 +127,48 @@ The schema of the JSON file is: Pros: -- Allows deletion and un-delete to be done in a single operation. +- Allows deletion and un-delete to be done in a single operation. Cons: - Negative impact on query performance when there are active tombstones. As in the chunk storage implementation, all the series will have to be compared to the matchers contained in the active tombstone files. The impact on performance should be the same as the deletion would have with chunk storage. -- With the default config, potential 30 minute wait for the data to begin filtering if using the default configuration. +- With the default config, potential 30 minute wait for the data to begin filtering if using the default configuration. #### Invalidating cache Using block store, the different caches available are: - Index cache - Metadata cache -- Chunks cache (stores the potentially to be deleted chunks of data) -- Query results cache (stores the potentially to be deleted data) +- Chunks cache (stores the potentially to be deleted chunks of data) +- Query results cache (stores the potentially to be deleted data) -There are two potential caches that could contain deleted data, the chunks cache, and the query results cache. Using the tombstones, the queriers filter out the data received from the ingesters and store-gateway. The cache not being processed through the querier needs to be invalidated to prevent deleted data from coming up in queries. +There are two potential caches that could contain deleted data, the chunks cache, and the query results cache. Using the tombstones, the queriers filter out the data received from the ingesters and store-gateway. The cache not being processed through the querier needs to be invalidated to prevent deleted data from coming up in queries. Firstly, the query results cache needs to be invalidated for each new delete request or a cancel delete request. This can be done using the same mechanism currently used for chunk storage by utilizing the cache generation numbers. For each tenant, their cache is prefixed with a cache generation number. This is already implemented into the middleware and would be easy to use for invalidating the cache. When the cache needs to be invalidated due to a delete or cancel delete request, the cache generation numbers would be increased (to the current timestamp), which would invalidate all the cache entries for a given tenant. With chunk store, the cache generation numbers are currently being stored in an Index table (e.g. DynamoDB or Bigtable). One option for block store is to save a per tenant key using the KV-store with the ring backend and propagate it using a Compare-And-Set/Swap (CAS) operation. Inside the query frontend, if the current cache generation number is older than the one in KV-store or it is empty, then the cache is invalidated and the cache generation number is updated. However, since the tombstones are uploaded asynchronously to the queriers, the results from the queriers might contain deleted data after the cache has been already invalidated. That would mean that the results can't be cached until there is a guarantee that the querier's responses have been processed using the most up to date tombstones. Here is the proposed method for ensuring this: -- When a deletion request is made or cancelled, the cache generation number is incremented for the given tenant in the KV-store to the current timestamp. +- When a deletion request is made or cancelled, the cache generation number is incremented for the given tenant in the KV-store to the current timestamp. - Inside the query frontend, the cache generation number will be outdated, so the cache is invalidated. The query frontend will now store the most recent cache generation number. -- When the compactor writes the tombstones to the bucket index, it will include the timestamp of when the write occurred. When the querier reads from the bucket index, it will store this timestamp. -- Inside the query frontend, if a response is not found in cache, it will ask at least one querier. Inside the response of each querier, it will include the timestamp of when the compactor wrote to the bucket index. The query frontend will compare the minimum timestamp that was returned by the queriers to the cache generation number (timestamp). If the minimum timestamp is larger than the cache generation number, that means that all the queriers loaded the bucket index after the compactor has updated it with the most up-to-date tombstones. If this is the case, then the response from each of the queriers should have the correct data and the results can be cached inside the query frontend. Otherwise, the response of the queriers might contain data that was not processed using the most up-to-date tombstones, and the results should not be cached. Using the minimum timestamp returned by the queriers, we can guarantee that the queriers have the most recent tombstones. +- When the compactor writes the tombstones to the bucket index, it will include the timestamp of when the write occurred. When the querier reads from the bucket index, it will store this timestamp. +- Inside the query frontend, if a response is not found in cache, it will ask at least one querier. Inside the response of each querier, it will include the timestamp of when the compactor wrote to the bucket index. The query frontend will compare the minimum timestamp that was returned by the queriers to the cache generation number (timestamp). If the minimum timestamp is larger than the cache generation number, that means that all the queriers loaded the bucket index after the compactor has updated it with the most up-to-date tombstones. If this is the case, then the response from each of the queriers should have the correct data and the results can be cached inside the query frontend. Otherwise, the response of the queriers might contain data that was not processed using the most up-to-date tombstones, and the results should not be cached. Using the minimum timestamp returned by the queriers, we can guarantee that the queriers have the most recent tombstones. -Furthermore, since the chunks cache is retrieved from the store gateway and passed to the querier, it will be filtered out like the rest of the time series data in the querier using the tombstones, with the mechanism described in the previous section. However, some issues may arise if the tombstone is deleted but the data to-be-deleted still exists in the chunks cache. To prevent this, we keep performing query time filtering using the tombstones once all the data has been deleted for an additional period of time. The filtering would be required until the store-gateway picks up the new blocks and the chunks cache is able to be refreshed with the new blocks without the deleted data. The `processed` state will begin as soon as all the requested data has been permanently deleted from the block store. Once in this state, the query time filtering will last for a length of `-compactor.deletion-delay + -compactor.cleanup-interval + -blocks-storage.bucket-store.sync-interval`. Once that time period has passed, the chunks cache should not have any of the deleted data. Then the tombstone will no longer be used for query time filtering. +Furthermore, since the chunks cache is retrieved from the store gateway and passed to the querier, it will be filtered out like the rest of the time series data in the querier using the tombstones, with the mechanism described in the previous section. However, some issues may arise if the tombstone is deleted but the data to-be-deleted still exists in the chunks cache. To prevent this, we keep performing query time filtering using the tombstones once all the data has been deleted for an additional period of time. The filtering would be required until the store-gateway picks up the new blocks and the chunks cache is able to be refreshed with the new blocks without the deleted data. The `processed` state will begin as soon as all the requested data has been permanently deleted from the block store. Once in this state, the query time filtering will last for a length of `-compactor.deletion-delay + -compactor.cleanup-interval + -blocks-storage.bucket-store.sync-interval`. Once that time period has passed, the chunks cache should not have any of the deleted data. Then the tombstone will no longer be used for query time filtering. ### Permanently deleting the data -The proposed approach is to perform the deletions from the compactor. A new background service inside the compactor called _DeletedSeriesCleaner_ can be created and is responsible for executing the deletion. +The proposed approach is to perform the deletions from the compactor. A new background service inside the compactor called _DeletedSeriesCleaner_ can be created and is responsible for executing the deletion. #### Processing -This will happen after a grace period has passed once the API request has been made. By default this should be 24 hours. The state of the request becomes `Deleting`. A background task can be created to process the permanent deletion of time series. This background task can be executed each hour. +This will happen after a grace period has passed once the API request has been made. By default this should be 24 hours. The state of the request becomes `Deleting`. A background task can be created to process the permanent deletion of time series. This background task can be executed each hour. To delete the data from the blocks, the same logic as the [Bucket Rewrite Tool](https://thanos.io/tip/components/tools.md/#bucket-rewrite ) from Thanos can be leveraged. This tool does the following: `tools bucket rewrite rewrites chosen blocks in the bucket, while deleting or modifying series`. The tool itself is a CLI tool that we won’t be using, but instead we can utilize the logic inside it. For more information about the way this tool runs, please see the code [here](https://github.com/thanos-io/thanos/blob/d8b21e708bee6d19f46ca32b158b0509ca9b7fed/cmd/thanos/tools_bucket.go#L809). -The compactor’s _DeletedSeriesCleaner_ will apply this logic on individual blocks and each time it is run, it creates a new block without the data that matched the deletion request. The original individual blocks containing the data that was requested to be deleted, need to be marked for deletion by the compactor. +The compactor’s _DeletedSeriesCleaner_ will apply this logic on individual blocks and each time it is run, it creates a new block without the data that matched the deletion request. The original individual blocks containing the data that was requested to be deleted, need to be marked for deletion by the compactor. -While deleting the data permanently from the block storage, the `meta.json` files will be used to keep track of the deletion progress. Inside each `meta.json` file, we will add a new field called `tombstonesFiltered`. This will store an array of deletion request id's that were used to create this block. Once the rewrite logic is applied to a block, the new block's `meta.json` file will append the deletion request id(s) used for the rewrite operation inside this field. This will let the _DeletedSeriesCleaner_ know that this block has already processed the particular deletions requests listed in this field. Assuming that the deletion requests are quite rare, the size of the meta.json files should remain small. +While deleting the data permanently from the block storage, the `meta.json` files will be used to keep track of the deletion progress. Inside each `meta.json` file, we will add a new field called `tombstonesFiltered`. This will store an array of deletion request id's that were used to create this block. Once the rewrite logic is applied to a block, the new block's `meta.json` file will append the deletion request id(s) used for the rewrite operation inside this field. This will let the _DeletedSeriesCleaner_ know that this block has already processed the particular deletions requests listed in this field. Assuming that the deletion requests are quite rare, the size of the meta.json files should remain small. The _DeletedSeriesCleaner_ can iterate through all the blocks that the deletion request could apply to. For each of these blocks, if the deletion request ID isn't inside the meta.json `tombstonesFiltered` field, then the compactor can apply the rewrite logic to this block. If there are multiple tombstones in the `Deleting` state that apply to a particular block, then the _DeletedSeriesCleaner_ will process both at the same time to prevent additional blocks from being created. If after iterating through all the blocks, it doesn’t find any such blocks requiring deletion, then the `Deleting` state is complete and the request progresses to the `Processed` state. @@ -180,72 +180,72 @@ Pros: - Simpler implementation as everything is contained within the DeletedSeriesCleaner. Cons: -- Might have to wait for a longer period of time for the compaction to be finished. - - This would mean the earliest time to be able to run the deletion would be once the last time from the block_ranges in the [compactor_config](https://cortexmetrics.io/docs/blocks-storage/compactor/#compactor-configuration) has passed. By default this value is 24 hours, so only once 24 hours have passed and the new compacted blocks have been created, then the rewrite can be safely run. +- Might have to wait for a longer period of time for the compaction to be finished. + - This would mean the earliest time to be able to run the deletion would be once the last time from the block_ranges in the [compactor_config](https://cortexmetrics.io/docs/blocks-storage/compactor/#compactor-configuration) has passed. By default this value is 24 hours, so only once 24 hours have passed and the new compacted blocks have been created, then the rewrite can be safely run. -Option 2: For blocks that still need to be compacted further after the deletion request cancel period is over, the deletion logic can be applied before the blocks are compacted. This will generate a new block which can then be used instead for compaction with other blocks. +Option 2: For blocks that still need to be compacted further after the deletion request cancel period is over, the deletion logic can be applied before the blocks are compacted. This will generate a new block which can then be used instead for compaction with other blocks. Pros: -- The deletion can be applied earlier than the previous options. - - Only applies if the deletion request cancel period is less than the last time interval for compaction is. +- The deletion can be applied earlier than the previous options. + - Only applies if the deletion request cancel period is less than the last time interval for compaction is. Cons: -- Added coupling between the compaction and the DeletedSeriesCleaner. -- Might block compaction for a short time while doing the deletion. +- Added coupling between the compaction and the DeletedSeriesCleaner. +- Might block compaction for a short time while doing the deletion. -Once all the applicable blocks have been rewritten without the deleted data, the deletion request state moves to `Processed`. Once in this state, the queriers will still have to perform query time filtering using the tombstones for a period of time. This period of time is described in the cache invalidation section. Once the time period is over, the tombstone will no longer be used. +Once all the applicable blocks have been rewritten without the deleted data, the deletion request state moves to `Processed`. Once in this state, the queriers will still have to perform query time filtering using the tombstones for a period of time. This period of time is described in the cache invalidation section. Once the time period is over, the tombstone will no longer be used. #### Handling failed/unfinished delete jobs: -Deletions will be completed and the tombstones will be deleted only when the DeletedSeriesCleaner iterates over all blocks that match the time interval and confirms that they have been re-written without the deleted data. Otherwise, it will keep iterating over the blocks and process the blocks that haven't been rewritten according to the information in the `meta.json` file. In case of any failure that causes the deletion to stop, any unfinished deletions will be resumed once the service is restarted. If the block rewrite was not completed on a particular block, then the original block will not be marked for deletion. The compactor will continue to iterate over the blocks and process the block again. +Deletions will be completed and the tombstones will be deleted only when the DeletedSeriesCleaner iterates over all blocks that match the time interval and confirms that they have been re-written without the deleted data. Otherwise, it will keep iterating over the blocks and process the blocks that haven't been rewritten according to the information in the `meta.json` file. In case of any failure that causes the deletion to stop, any unfinished deletions will be resumed once the service is restarted. If the block rewrite was not completed on a particular block, then the original block will not be marked for deletion. The compactor will continue to iterate over the blocks and process the block again. #### Tenant Deletion API -If a request is made to delete a tenant, then all the tombstones will be deleted for that user. +If a request is made to delete a tenant, then all the tombstones will be deleted for that user. ## Current Open Questions: -- If the start and end time is very far apart, it might result in a lot of the data being re-written. Since we create a new block without the deleted data and mark the old one for deletion, there may be a period of time with lots of extra blocks and space used for large deletion queries. -- There will be a delay between the deletion request and the deleted data being filtered during queries. - - In Prometheus, there is no delay. - - One way to filter out immediately is to load the tombstones during query time but this will cause a negative performance impact. +- If the start and end time is very far apart, it might result in a lot of the data being re-written. Since we create a new block without the deleted data and mark the old one for deletion, there may be a period of time with lots of extra blocks and space used for large deletion queries. +- There will be a delay between the deletion request and the deleted data being filtered during queries. + - In Prometheus, there is no delay. + - One way to filter out immediately is to load the tombstones during query time but this will cause a negative performance impact. - Adding limits to the API such as: - The number of deletion requests per day, - Number of requests allowed at a time - - How wide apart the start and end time can be. + - How wide apart the start and end time can be. ## Alternatives Considered #### Adding a Pre-processing State -The process of permanently deleting the data can be separated into 2 stages, preprocessing and processing. +The process of permanently deleting the data can be separated into 2 stages, preprocessing and processing. Pre-processing will begin after the `-purger.delete-request-cancel-period` has passed since the API request has been made. The deletion request will move to a new state called `BuildingPlan`. The compactor will outline all the blocks that may contain data to be deleted. For each separate block that the deletion may be applicable to, the compactor will begin the process by adding a series deletion marker inside the series-deletion-marker.json file. The JSON file will contain an array of deletion request id's that need to be applied to the block, which allows the ability to handle the situation when there are multiple tombstones that could be applicable to a particular block. Then during the processing step, instead of checking the meta.json file, we only need to check if a marker file exists with a specific deletion request id. If the marker file exists, then we apply the rewrite logic. #### Alternative Permanent Deletion Processing -For processing the actual deletions, an alternative approach is not to wait until the final compaction has been completed and filter out the data during compaction. If the data is marked to be deleted, then don’t include it the new bigger block during compaction. For the remaining blocks where the data wasn’t filtered during compaction, the deletion can be done the same as in the previous section. +For processing the actual deletions, an alternative approach is not to wait until the final compaction has been completed and filter out the data during compaction. If the data is marked to be deleted, then don’t include it the new bigger block during compaction. For the remaining blocks where the data wasn’t filtered during compaction, the deletion can be done the same as in the previous section. -Pros: +Pros: -- The deletion can happen sooner. -- The rewrite tools creates additional blocks. By filtering the metrics during compaction, the intermediary re-written block will be avoided. +- The deletion can happen sooner. +- The rewrite tools creates additional blocks. By filtering the metrics during compaction, the intermediary re-written block will be avoided. -Cons: +Cons: - A more complicated implementation requiring add more logic to the compactor - Slower compaction if it needs to filter all the data -- Need to manage which blocks should be deleted with the rewrite vs which blocks already had data filtered during compaction. +- Need to manage which blocks should be deleted with the rewrite vs which blocks already had data filtered during compaction. - Would need to run the rewrite logic during and outside of compaction because some blocks that might need to be deleted are already in the final compaction state. So that would mean the deletion functionality has to be implemented in multiple places. - Won’t be leveraging the rewrites tools from Thanos for all the deletion, so potentially more work is duplicated From f95b002e56403f6b7db8918e6606c649cfb4f3ff Mon Sep 17 00:00:00 2001 From: ilangofman Date: Fri, 2 Jul 2021 01:10:28 -0400 Subject: [PATCH 10/14] Remove the deleting state and change cache invalidation Signed-off-by: ilangofman --- .../block-storage-time-series-deletion.md | 36 +++++++++---------- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/docs/proposals/block-storage-time-series-deletion.md b/docs/proposals/block-storage-time-series-deletion.md index fcabe043205..fbfdf352d72 100644 --- a/docs/proposals/block-storage-time-series-deletion.md +++ b/docs/proposals/block-storage-time-series-deletion.md @@ -75,20 +75,19 @@ Prometheus also implements a [clean_tombstones](https://prometheus.io/docs/prome ### Deletion Lifecycle -The deletion request lifecycle can follow these 3 states: +The deletion request lifecycle can follow these 2 states: + +1. Pending - Tombstone file is created. During this state, the queriers will be performing query time filtering. The initial time period configured by `-purger.delete-request-cancel-period`, no data will be deleted. This will allow the user some time to cancel the deletion request if it was made by mistake. Once this period is over, permanent deletion processing will begin and the request is no longer cancellable. +2. Processed - All requested data has been deleted. Initially, will still need to do query time filtering while waiting for the bucket index and store-gateway to pick up the new blocks. Once that period has passed, will no longer require any query time filtering. -1. Pending - Tombstone file is created, just doing query time filtering. No data has been deleted yet. -2. Deleting - Running delete operations and still doing query time filtering. -4. Processed - All requested data deleted, chunks cache should contain new blocks. Initially, will still need to do query time filtering while waiting for the bucket index and store-gateway to pick up the new blocks and replace the old chunks cache. Once that period has passed, will no longer require any query time filtering. -The amount of time for the request to move from `Pending` to `Deleting` is dependent on a config option: `-purger.delete-request-cancel-period`. The purpose of this is to allow the user some time to cancel the deletion request if it was made by mistake. ### Filtering data during queries while not yet deleted: Once a deletion request is received, a tombstone entry will be created. The object store such as S3, GCS, Azure storage, can be used to store all the deletion requests. See the section below for more detail on how the tombstones will be stored. Using the tombstones, the querier will be able to filter the to-be-deleted data initially. If a cancel delete request is made, then the tombstone file will be deleted. In addition, the existing cache will be invalidated using cache generation numbers, which are described in the later sections. -The compactor will scan for new tombstone files and will update the bucket-index with the tombstone information regarding the deletion requests. This will enable the querier to periodically check the bucket index if there are any new tombstone files that are required to be used for filtering. One drawback of this approach is the time it could take to start filtering the data. Since the compactor will update the bucket index with the new tombstones every `-compactor.cleanup-interval` (default 15 min). Then the cached bucket index is refreshed in the querier every `-blocks-storage.bucket-store.sync-interval` (default 15 min). Potentially could take almost 30 min for queriers to start filtering deleted data when using the default values. If the information requested for deletion is confidential/classified, the time delay is something that the user should be aware of, in addition to the time that the data has already been in Cortex. +The compactor's _BlocksCleaner_ service will scan for new tombstone files and will update the bucket-index with the tombstone information regarding the deletion requests. This will enable the querier to periodically check the bucket index if there are any new tombstone files that are required to be used for filtering. One drawback of this approach is the time it could take to start filtering the data. Since the compactor will update the bucket index with the new tombstones every `-compactor.cleanup-interval` (default 15 min). Then the cached bucket index is refreshed in the querier every `-blocks-storage.bucket-store.sync-interval` (default 15 min). Potentially could take almost 30 min for queriers to start filtering deleted data when using the default values. If the information requested for deletion is confidential/classified, the time delay is something that the user should be aware of, in addition to the time that the data has already been in Cortex. An additional thing to consider is that this would mean that the bucket-index would have to be enabled for this API to work. Since the plan is to make to the bucket-index mandatory in the future for block storage, this shouldn't be an issue. @@ -97,9 +96,10 @@ Similar to the chunk storage deletion implementation, the initial filtering of t #### Storing tombstones in object store -The Purger will write the new tombstone entries in a separate folder called `tombstones` in the object store (e.g. S3 bucket) in the respective tenant folder. Each tombstone can have a separate JSON file outlining all the necessary information about the deletion request such as the parameters passed in the request, as well as some meta-data such as the creation date of the file. The name of the file can be a hash of the API parameters (start, end, markers). This way if a user calls the API twice by accident with the same parameters, it will only create one tombstone. To keep track of the request state, filename extensions can be used. This will allow the tombstone files to be immutable. The 3 different file extensions will be `pending, deleting, processed`. Each time the deletion request moves to a new state, a new file will be added with the same content but a different extension to indicate the new state. The file containing the previous state will be deleted once the new one is created. +The Purger will write the new tombstone entries in a separate folder called `tombstones` in the object store (e.g. S3 bucket) in the respective tenant folder. Each tombstone can have a separate JSON file outlining all the necessary information about the deletion request such as the parameters passed in the request, as well as some meta-data such as the creation date of the file. The name of the file can be a hash of the API parameters (start, end, markers). This way if a user calls the API twice by accident with the same parameters, it will only create one tombstone. To keep track of the request state, filename extensions can be used. This will allow the tombstone files to be immutable. The 2 different file extensions will be `pending, processed`. Each time the deletion request moves to a new state, a new file will be added with the same content but a different extension to indicate the new state. The file containing the previous state will be deleted once the new one is created. + -Updating the states will be done from the compactor. Inside the compactor, the new _DeletedSeriesCleaner_ service will periodically check all the tombstones to see if their current state is ready to be upgraded. If it is determined that the request should move to the next state, then it will first write a new file containing the tombstone information. The information inside the file will be the same except the `creationTime`, which is replaced with the current timestamp. The extension of the new file will be different to reflect the new state. If the new file is successfully written, the file with the previous state is deleted. If the write of the new file fails, then the previous file is not going to be deleted. Next time the service runs to check the state of each tombstone, it will retry creating the new file with the updated state. If the write is successful but the deletion of the old file is unsuccessful then there will be 2 tombstone files with the same filename but different extension. When writing to the bucket index, the compactor will check for duplicate tombstone files but with different extensions. It will use the tombstone with the most recently updated state and try to delete the file with the older state. + When it is determined that the request should move to the next state, then it will first write a new file containing the tombstone information to the object store. The information inside the file will be the same except the `creationTime`, which is replaced with the current timestamp. The extension of the new file will be different to reflect the new state. If the new file is successfully written, the file with the previous state is deleted. If the write of the new file fails, then the previous file is not going to be deleted. Next time the service runs to check the state of each tombstone, it will retry creating the new file with the updated state. If the write is successful but the deletion of the old file is unsuccessful then there will be 2 tombstone files with the same filename but different extension. When `BlocksCleaner` writes the tombstones to the bucket index, the compactor will check for duplicate tombstone files but with different extensions. It will use the tombstone with the most recently updated state and try to delete the file with the older state. Since there are only two states, there could be a scenario where there are 2 files with the same request ID but the extensions: `.pending` and `.processed`. In this case, the `.processed` file will be selected as it is always the later state. The tombstone will be stored in a single JSON file per request and state: @@ -145,14 +145,15 @@ Using block store, the different caches available are: There are two potential caches that could contain deleted data, the chunks cache, and the query results cache. Using the tombstones, the queriers filter out the data received from the ingesters and store-gateway. The cache not being processed through the querier needs to be invalidated to prevent deleted data from coming up in queries. -Firstly, the query results cache needs to be invalidated for each new delete request or a cancel delete request. This can be done using the same mechanism currently used for chunk storage by utilizing the cache generation numbers. For each tenant, their cache is prefixed with a cache generation number. This is already implemented into the middleware and would be easy to use for invalidating the cache. When the cache needs to be invalidated due to a delete or cancel delete request, the cache generation numbers would be increased (to the current timestamp), which would invalidate all the cache entries for a given tenant. With chunk store, the cache generation numbers are currently being stored in an Index table (e.g. DynamoDB or Bigtable). One option for block store is to save a per tenant key using the KV-store with the ring backend and propagate it using a Compare-And-Set/Swap (CAS) operation. Inside the query frontend, if the current cache generation number is older than the one in KV-store or it is empty, then the cache is invalidated and the cache generation number is updated. However, since the tombstones are uploaded asynchronously to the queriers, the results from the queriers might contain deleted data after the cache has been already invalidated. That would mean that the results can't be cached until there is a guarantee that the querier's responses have been processed using the most up to date tombstones. Here is the proposed method for ensuring this: +Firstly, the query results cache needs to be invalidated for each new delete request or a cancellation of one. This can be accomplished by utilizing cache generation numbers. For each tenant, their cache is prefixed with a cache generation number. When the query front-end discovers a cache generation number that is greater than the previous generation number, then it knows to invalidate the query results cache. However, the cache can only be invalidated once the queriers have loaded the tombstones from the bucket index and have begun filtering the data. Otherwise, to-be deleted data might show up in queries and be cached again. One of the way to guarantee that all the queriers are using the new tombstones is to wait until the bucket index staleness period has passed from the time the tombstones have been written to the bucket index. The staleness period can be configured using the following flag: `-blocks-storage.bucket-store.bucket-index.max-stale-period`. We can use the bucket index staleness period as the delay to wait before the cache generation number is increased. A query will fail inside the querier, if the bucket index last update is older the staleness period. Once this period is over, all the queriers should have the updated tombstones and the query results cache can be invalidated. Here is the proposed method for accomplishing this: -- When a deletion request is made or cancelled, the cache generation number is incremented for the given tenant in the KV-store to the current timestamp. -- Inside the query frontend, the cache generation number will be outdated, so the cache is invalidated. The query frontend will now store the most recent cache generation number. -- When the compactor writes the tombstones to the bucket index, it will include the timestamp of when the write occurred. When the querier reads from the bucket index, it will store this timestamp. -- Inside the query frontend, if a response is not found in cache, it will ask at least one querier. Inside the response of each querier, it will include the timestamp of when the compactor wrote to the bucket index. The query frontend will compare the minimum timestamp that was returned by the queriers to the cache generation number (timestamp). If the minimum timestamp is larger than the cache generation number, that means that all the queriers loaded the bucket index after the compactor has updated it with the most up-to-date tombstones. If this is the case, then the response from each of the queriers should have the correct data and the results can be cached inside the query frontend. Otherwise, the response of the queriers might contain data that was not processed using the most up-to-date tombstones, and the results should not be cached. Using the minimum timestamp returned by the queriers, we can guarantee that the queriers have the most recent tombstones. -Furthermore, since the chunks cache is retrieved from the store gateway and passed to the querier, it will be filtered out like the rest of the time series data in the querier using the tombstones, with the mechanism described in the previous section. However, some issues may arise if the tombstone is deleted but the data to-be-deleted still exists in the chunks cache. To prevent this, we keep performing query time filtering using the tombstones once all the data has been deleted for an additional period of time. The filtering would be required until the store-gateway picks up the new blocks and the chunks cache is able to be refreshed with the new blocks without the deleted data. The `processed` state will begin as soon as all the requested data has been permanently deleted from the block store. Once in this state, the query time filtering will last for a length of `-compactor.deletion-delay + -compactor.cleanup-interval + -blocks-storage.bucket-store.sync-interval`. Once that time period has passed, the chunks cache should not have any of the deleted data. Then the tombstone will no longer be used for query time filtering. +- The cache generation number will be a timestamp. It will also serve as the time of when it becomes valid and the query front-end can use it. +- The bucket index will store the cache generation number. The query front-end will periodically fetch the bucket index. +- Inside the compactor, it will load the tombstones from object store and update the bucket index accordingly. If a deletion request is made or cancelled, the compactor will discover this and increment the cache generation number in the bucket index. The cache generation number will be the current timestamp + the max stale period of the bucket index. The compactor can discover if there have been any changes to the tombstones by comparing the newly loaded tombstones to the one's currently in the bucket index. +- The query front-end will fetch the cache generation number from the bucket index. If the current timestamp is less than the cache generation number, it will simply not do anything as the generation number is not yet valid. If the query front-end discovers that the current time has passed the cache generation timestamp from the bucket index, then it is valid and can be used. The query front end will compare it to the current cache generation number stored in the front-end. If the cache generation number from the front-end is less than the one from bucket index, then the cache is invalidated. + +In regards to the chunks cache, since it is retrieved from the store gateway and passed to the querier, it will be filtered out like the rest of the time series data in the querier using the tombstones, with the mechanism described in the previous section. ### Permanently deleting the data @@ -161,7 +162,7 @@ The proposed approach is to perform the deletions from the compactor. A new back #### Processing -This will happen after a grace period has passed once the API request has been made. By default this should be 24 hours. The state of the request becomes `Deleting`. A background task can be created to process the permanent deletion of time series. This background task can be executed each hour. +This will happen after a grace period has passed once the API request has been made. By default this should be 24 hours. A background task can be created to process the permanent deletion of time series. This background task can be executed each hour. To delete the data from the blocks, the same logic as the [Bucket Rewrite Tool](https://thanos.io/tip/components/tools.md/#bucket-rewrite ) from Thanos can be leveraged. This tool does the following: `tools bucket rewrite rewrites chosen blocks in the bucket, while deleting or modifying series`. The tool itself is a CLI tool that we won’t be using, but instead we can utilize the logic inside it. For more information about the way this tool runs, please see the code [here](https://github.com/thanos-io/thanos/blob/d8b21e708bee6d19f46ca32b158b0509ca9b7fed/cmd/thanos/tools_bucket.go#L809). @@ -170,7 +171,7 @@ The compactor’s _DeletedSeriesCleaner_ will apply this logic on individual blo While deleting the data permanently from the block storage, the `meta.json` files will be used to keep track of the deletion progress. Inside each `meta.json` file, we will add a new field called `tombstonesFiltered`. This will store an array of deletion request id's that were used to create this block. Once the rewrite logic is applied to a block, the new block's `meta.json` file will append the deletion request id(s) used for the rewrite operation inside this field. This will let the _DeletedSeriesCleaner_ know that this block has already processed the particular deletions requests listed in this field. Assuming that the deletion requests are quite rare, the size of the meta.json files should remain small. -The _DeletedSeriesCleaner_ can iterate through all the blocks that the deletion request could apply to. For each of these blocks, if the deletion request ID isn't inside the meta.json `tombstonesFiltered` field, then the compactor can apply the rewrite logic to this block. If there are multiple tombstones in the `Deleting` state that apply to a particular block, then the _DeletedSeriesCleaner_ will process both at the same time to prevent additional blocks from being created. If after iterating through all the blocks, it doesn’t find any such blocks requiring deletion, then the `Deleting` state is complete and the request progresses to the `Processed` state. +The _DeletedSeriesCleaner_ can iterate through all the blocks that the deletion request could apply to. For each of these blocks, if the deletion request ID isn't inside the meta.json `tombstonesFiltered` field, then the compactor can apply the rewrite logic to this block. If there are multiple tombstones that are currently being processing for deletions and apply to a particular block, then the _DeletedSeriesCleaner_ will process both at the same time to prevent additional blocks from being created. If after iterating through all the blocks, it doesn’t find any such blocks requiring deletion, then the `Pending` state is complete and the request progresses to the `Processed` state. One important thing to note regarding this rewrite tool is that it should not be used at the same time as when another compactor is touching a block. If the tool is run at the same time as compaction on a particular block, it can cause overlap and the data marked for deletion can already be part of the compacted block. To mitigate such issues, these are some of the proposed solutions: @@ -197,8 +198,7 @@ Cons: -Once all the applicable blocks have been rewritten without the deleted data, the deletion request state moves to `Processed`. Once in this state, the queriers will still have to perform query time filtering using the tombstones for a period of time. This period of time is described in the cache invalidation section. Once the time period is over, the tombstone will no longer be used. - +Once all the applicable blocks have been rewritten without the deleted data, the deletion request state moves to the `Processed` state. Once in this state, the queriers will still have to perform query time filtering using the tombstones until the old blocks that were marked for deletion are no longer queried by the queriers. This will mean that the query time filtering will last for an additional length of `-compactor.deletion-delay + -compactor.cleanup-interval + -blocks-storage.bucket-store.sync-interval` in the `Processed` state. Once that time period has passed, the queriers should no longer be querying any of the old blocks that were marked for deletion. The tombstone will no longer be used after this. #### Handling failed/unfinished delete jobs: From add4da27a10083ed3478b0d2e0b991fb80c11367 Mon Sep 17 00:00:00 2001 From: ilangofman Date: Thu, 8 Jul 2021 10:12:27 -0400 Subject: [PATCH 11/14] Add deleted state and update cache invalidation Signed-off-by: ilangofman --- .../block-storage-time-series-deletion.md | 26 +++++++++++-------- 1 file changed, 15 insertions(+), 11 deletions(-) diff --git a/docs/proposals/block-storage-time-series-deletion.md b/docs/proposals/block-storage-time-series-deletion.md index fbfdf352d72..999779a721d 100644 --- a/docs/proposals/block-storage-time-series-deletion.md +++ b/docs/proposals/block-storage-time-series-deletion.md @@ -75,11 +75,11 @@ Prometheus also implements a [clean_tombstones](https://prometheus.io/docs/prome ### Deletion Lifecycle -The deletion request lifecycle can follow these 2 states: +The deletion request lifecycle can follow these 3 states: -1. Pending - Tombstone file is created. During this state, the queriers will be performing query time filtering. The initial time period configured by `-purger.delete-request-cancel-period`, no data will be deleted. This will allow the user some time to cancel the deletion request if it was made by mistake. Once this period is over, permanent deletion processing will begin and the request is no longer cancellable. +1. Pending - Tombstone file is created. During this state, the queriers will be performing query time filtering. The initial time period configured by `-purger.delete-request-cancel-period`, no data will be deleted. Once this period is over, permanent deletion processing will begin and the request is no longer cancellable. 2. Processed - All requested data has been deleted. Initially, will still need to do query time filtering while waiting for the bucket index and store-gateway to pick up the new blocks. Once that period has passed, will no longer require any query time filtering. - +3. Deleted - The deletion request was cancelled. A grace period configured by `-purger.delete-request-cancel-period` will allow the user some time to cancel the deletion request if it was made by mistake. The request is no longer cancelable after this period has passed. @@ -96,10 +96,9 @@ Similar to the chunk storage deletion implementation, the initial filtering of t #### Storing tombstones in object store -The Purger will write the new tombstone entries in a separate folder called `tombstones` in the object store (e.g. S3 bucket) in the respective tenant folder. Each tombstone can have a separate JSON file outlining all the necessary information about the deletion request such as the parameters passed in the request, as well as some meta-data such as the creation date of the file. The name of the file can be a hash of the API parameters (start, end, markers). This way if a user calls the API twice by accident with the same parameters, it will only create one tombstone. To keep track of the request state, filename extensions can be used. This will allow the tombstone files to be immutable. The 2 different file extensions will be `pending, processed`. Each time the deletion request moves to a new state, a new file will be added with the same content but a different extension to indicate the new state. The file containing the previous state will be deleted once the new one is created. - +The Purger will write the new tombstone entries in a separate folder called `tombstones` in the object store (e.g. S3 bucket) in the respective tenant folder. Each tombstone can have a separate JSON file outlining all the necessary information about the deletion request such as the parameters passed in the request, as well as some meta-data such as the creation date of the file. The name of the file can be a hash of the API parameters (start, end, markers). This way if a user calls the API twice by accident with the same parameters, it will only create one tombstone. To keep track of the request state, filename extensions can be used. This will allow the tombstone files to be immutable. The 3 different file extensions will be `pending, processed, deleted`. Each time the deletion request moves to a new state, a new file will be added with the same deletion information but a different extension to indicate the new state. The file containing the previous state will be deleted once the new one is created. If a deletion request is cancelled, then a tombstone file with the `.deleted` filename extension will be created. - When it is determined that the request should move to the next state, then it will first write a new file containing the tombstone information to the object store. The information inside the file will be the same except the `creationTime`, which is replaced with the current timestamp. The extension of the new file will be different to reflect the new state. If the new file is successfully written, the file with the previous state is deleted. If the write of the new file fails, then the previous file is not going to be deleted. Next time the service runs to check the state of each tombstone, it will retry creating the new file with the updated state. If the write is successful but the deletion of the old file is unsuccessful then there will be 2 tombstone files with the same filename but different extension. When `BlocksCleaner` writes the tombstones to the bucket index, the compactor will check for duplicate tombstone files but with different extensions. It will use the tombstone with the most recently updated state and try to delete the file with the older state. Since there are only two states, there could be a scenario where there are 2 files with the same request ID but the extensions: `.pending` and `.processed`. In this case, the `.processed` file will be selected as it is always the later state. +When it is determined that the request should move to the next state, then it will first write a new file containing the tombstone information to the object store. The information inside the file will be the same except the `stateCreationTime`, which is replaced with the current timestamp. The extension of the new file will be different to reflect the new state. If the new file is successfully written, the file with the previous state is deleted. If the write of the new file fails, then the previous file is not going to be deleted. Next time the service runs to check the state of each tombstone, it will retry creating the new file with the updated state. If the write is successful but the deletion of the old file is unsuccessful then there will be 2 tombstone files with the same filename but different extension. When `BlocksCleaner` writes the tombstones to the bucket index, the compactor will check for duplicate tombstone files but with different extensions. It will use the tombstone with the most recently updated state and try to delete the file with the older state. There could be a scenario where there are two files with the same request ID but different extensions: {`.pending`, `.processed`} or {`.pending`, `.deleted`}. In this case, the `.processed` or `.deleted ` file will be selected as it is always the later state compared to the `pending` state. The tombstone will be stored in a single JSON file per request and state: @@ -114,7 +113,8 @@ The schema of the JSON file is: "requestId": , "startTime": , "endTime": , - "creationTime": , + "requestCreationTime": , + "stateCreationTime": , "matchers": [ "", .., @@ -148,10 +148,11 @@ There are two potential caches that could contain deleted data, the chunks cache Firstly, the query results cache needs to be invalidated for each new delete request or a cancellation of one. This can be accomplished by utilizing cache generation numbers. For each tenant, their cache is prefixed with a cache generation number. When the query front-end discovers a cache generation number that is greater than the previous generation number, then it knows to invalidate the query results cache. However, the cache can only be invalidated once the queriers have loaded the tombstones from the bucket index and have begun filtering the data. Otherwise, to-be deleted data might show up in queries and be cached again. One of the way to guarantee that all the queriers are using the new tombstones is to wait until the bucket index staleness period has passed from the time the tombstones have been written to the bucket index. The staleness period can be configured using the following flag: `-blocks-storage.bucket-store.bucket-index.max-stale-period`. We can use the bucket index staleness period as the delay to wait before the cache generation number is increased. A query will fail inside the querier, if the bucket index last update is older the staleness period. Once this period is over, all the queriers should have the updated tombstones and the query results cache can be invalidated. Here is the proposed method for accomplishing this: -- The cache generation number will be a timestamp. It will also serve as the time of when it becomes valid and the query front-end can use it. +- The cache generation number will be a timestamp. The default value will be 0. - The bucket index will store the cache generation number. The query front-end will periodically fetch the bucket index. -- Inside the compactor, it will load the tombstones from object store and update the bucket index accordingly. If a deletion request is made or cancelled, the compactor will discover this and increment the cache generation number in the bucket index. The cache generation number will be the current timestamp + the max stale period of the bucket index. The compactor can discover if there have been any changes to the tombstones by comparing the newly loaded tombstones to the one's currently in the bucket index. -- The query front-end will fetch the cache generation number from the bucket index. If the current timestamp is less than the cache generation number, it will simply not do anything as the generation number is not yet valid. If the query front-end discovers that the current time has passed the cache generation timestamp from the bucket index, then it is valid and can be used. The query front end will compare it to the current cache generation number stored in the front-end. If the cache generation number from the front-end is less than the one from bucket index, then the cache is invalidated. +- Inside the compactor, the _BlocksCleaner_ will load the tombstones from object store and update the bucket index accordingly. It will calculate the cache generation number by iterating through all the tombstones and their respective times (next bullet point) and selecting the maximum timestamp that is less than (current time minus `-blocks-storage.bucket-store.bucket-index.max-stale-period`). This would mean that if a deletion request is made or cancelled, the compactor will only update the cache generation number once the staleness period is over, ensuring that all queriers have the updated tombstones. +- For requests in a pending or processed state, the `requestCreationTime` will be used when comparing the maximum timestamps. If a request is in a deleted state, it will use the `stateCreationTime` for comparing the timestamps. This means that the cache gets invalidated only once it has been created or deleted, and the bucket index staleness period has passed. The cache will not be invalidated when a request advances from pending to processed state. +- The query front-end will fetch the cache generation number from the bucket index. The query front end will compare it to the current cache generation number stored in the front-end. If the cache generation number from the front-end is less than the one from bucket index, then the cache is invalidated. In regards to the chunks cache, since it is retrieved from the store gateway and passed to the querier, it will be filtered out like the rest of the time series data in the querier using the tombstones, with the mechanism described in the previous section. @@ -201,6 +202,10 @@ Cons: Once all the applicable blocks have been rewritten without the deleted data, the deletion request state moves to the `Processed` state. Once in this state, the queriers will still have to perform query time filtering using the tombstones until the old blocks that were marked for deletion are no longer queried by the queriers. This will mean that the query time filtering will last for an additional length of `-compactor.deletion-delay + -compactor.cleanup-interval + -blocks-storage.bucket-store.sync-interval` in the `Processed` state. Once that time period has passed, the queriers should no longer be querying any of the old blocks that were marked for deletion. The tombstone will no longer be used after this. +#### Cancelled Delete Requests + +If a request was successfully cancelled, then a tombstone file a `.deleted` extension is created. This is done to help ensure that the cache generation number is updated and the query results cache is invalidated. The compactor's blocks cleaner can take care of cleaning up `.deleted` tombstones after a period of time of when they are no longer required for cache invalidation. This can be done after 10 times the bucket index max staleness time period has passed. Before removing the file from the object store, the current cache generation number must greater than or equal to when the tombstone was cancelled. + #### Handling failed/unfinished delete jobs: Deletions will be completed and the tombstones will be deleted only when the DeletedSeriesCleaner iterates over all blocks that match the time interval and confirms that they have been re-written without the deleted data. Otherwise, it will keep iterating over the blocks and process the blocks that haven't been rewritten according to the information in the `meta.json` file. In case of any failure that causes the deletion to stop, any unfinished deletions will be resumed once the service is restarted. If the block rewrite was not completed on a particular block, then the original block will not be marked for deletion. The compactor will continue to iterate over the blocks and process the block again. @@ -211,7 +216,6 @@ Deletions will be completed and the tombstones will be deleted only when the Del If a request is made to delete a tenant, then all the tombstones will be deleted for that user. - ## Current Open Questions: - If the start and end time is very far apart, it might result in a lot of the data being re-written. Since we create a new block without the deleted data and mark the old one for deletion, there may be a period of time with lots of extra blocks and space used for large deletion queries. From 434df4847a2bbb4d3beea2ca9090bc25b4fdda5d Mon Sep 17 00:00:00 2001 From: ilangofman Date: Thu, 8 Jul 2021 10:17:22 -0400 Subject: [PATCH 12/14] Add one word to clear things up Signed-off-by: ilangofman --- docs/proposals/block-storage-time-series-deletion.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/proposals/block-storage-time-series-deletion.md b/docs/proposals/block-storage-time-series-deletion.md index 999779a721d..023d0eb4caa 100644 --- a/docs/proposals/block-storage-time-series-deletion.md +++ b/docs/proposals/block-storage-time-series-deletion.md @@ -151,7 +151,7 @@ Firstly, the query results cache needs to be invalidated for each new delete req - The cache generation number will be a timestamp. The default value will be 0. - The bucket index will store the cache generation number. The query front-end will periodically fetch the bucket index. - Inside the compactor, the _BlocksCleaner_ will load the tombstones from object store and update the bucket index accordingly. It will calculate the cache generation number by iterating through all the tombstones and their respective times (next bullet point) and selecting the maximum timestamp that is less than (current time minus `-blocks-storage.bucket-store.bucket-index.max-stale-period`). This would mean that if a deletion request is made or cancelled, the compactor will only update the cache generation number once the staleness period is over, ensuring that all queriers have the updated tombstones. -- For requests in a pending or processed state, the `requestCreationTime` will be used when comparing the maximum timestamps. If a request is in a deleted state, it will use the `stateCreationTime` for comparing the timestamps. This means that the cache gets invalidated only once it has been created or deleted, and the bucket index staleness period has passed. The cache will not be invalidated when a request advances from pending to processed state. +- For requests in a pending or processed state, the `requestCreationTime` will be used when comparing the maximum timestamps. If a request is in a deleted state, it will use the `stateCreationTime` for comparing the timestamps. This means that the cache gets invalidated only once it has been created or deleted, and the bucket index staleness period has passed. The cache will not be invalidated again when a request advances from pending to processed state. - The query front-end will fetch the cache generation number from the bucket index. The query front end will compare it to the current cache generation number stored in the front-end. If the cache generation number from the front-end is less than the one from bucket index, then the cache is invalidated. In regards to the chunks cache, since it is retrieved from the store gateway and passed to the querier, it will be filtered out like the rest of the time series data in the querier using the tombstones, with the mechanism described in the previous section. From 98abab0bff199f12db16e15091bf89b971838e3b Mon Sep 17 00:00:00 2001 From: ilangofman Date: Thu, 8 Jul 2021 10:25:44 -0400 Subject: [PATCH 13/14] update api limits section Signed-off-by: ilangofman --- docs/proposals/block-storage-time-series-deletion.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/proposals/block-storage-time-series-deletion.md b/docs/proposals/block-storage-time-series-deletion.md index 023d0eb4caa..326aee99f0c 100644 --- a/docs/proposals/block-storage-time-series-deletion.md +++ b/docs/proposals/block-storage-time-series-deletion.md @@ -223,9 +223,9 @@ If a request is made to delete a tenant, then all the tombstones will be deleted - In Prometheus, there is no delay. - One way to filter out immediately is to load the tombstones during query time but this will cause a negative performance impact. - Adding limits to the API such as: - - The number of deletion requests per day, - - Number of requests allowed at a time - - How wide apart the start and end time can be. + - Max number of deletion requests allowed in the last 24 hours for a given tenent. + - Max number of pending tombstones for a given tenant. + ## Alternatives Considered From bc19925656952b2e14882f3c8058517515a04e95 Mon Sep 17 00:00:00 2001 From: ilangofman Date: Fri, 9 Jul 2021 07:56:30 -0400 Subject: [PATCH 14/14] ran clean white noise Signed-off-by: ilangofman --- .../block-storage-time-series-deletion.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/proposals/block-storage-time-series-deletion.md b/docs/proposals/block-storage-time-series-deletion.md index 326aee99f0c..20d7a346e52 100644 --- a/docs/proposals/block-storage-time-series-deletion.md +++ b/docs/proposals/block-storage-time-series-deletion.md @@ -79,7 +79,7 @@ The deletion request lifecycle can follow these 3 states: 1. Pending - Tombstone file is created. During this state, the queriers will be performing query time filtering. The initial time period configured by `-purger.delete-request-cancel-period`, no data will be deleted. Once this period is over, permanent deletion processing will begin and the request is no longer cancellable. 2. Processed - All requested data has been deleted. Initially, will still need to do query time filtering while waiting for the bucket index and store-gateway to pick up the new blocks. Once that period has passed, will no longer require any query time filtering. -3. Deleted - The deletion request was cancelled. A grace period configured by `-purger.delete-request-cancel-period` will allow the user some time to cancel the deletion request if it was made by mistake. The request is no longer cancelable after this period has passed. +3. Deleted - The deletion request was cancelled. A grace period configured by `-purger.delete-request-cancel-period` will allow the user some time to cancel the deletion request if it was made by mistake. The request is no longer cancelable after this period has passed. @@ -96,7 +96,7 @@ Similar to the chunk storage deletion implementation, the initial filtering of t #### Storing tombstones in object store -The Purger will write the new tombstone entries in a separate folder called `tombstones` in the object store (e.g. S3 bucket) in the respective tenant folder. Each tombstone can have a separate JSON file outlining all the necessary information about the deletion request such as the parameters passed in the request, as well as some meta-data such as the creation date of the file. The name of the file can be a hash of the API parameters (start, end, markers). This way if a user calls the API twice by accident with the same parameters, it will only create one tombstone. To keep track of the request state, filename extensions can be used. This will allow the tombstone files to be immutable. The 3 different file extensions will be `pending, processed, deleted`. Each time the deletion request moves to a new state, a new file will be added with the same deletion information but a different extension to indicate the new state. The file containing the previous state will be deleted once the new one is created. If a deletion request is cancelled, then a tombstone file with the `.deleted` filename extension will be created. +The Purger will write the new tombstone entries in a separate folder called `tombstones` in the object store (e.g. S3 bucket) in the respective tenant folder. Each tombstone can have a separate JSON file outlining all the necessary information about the deletion request such as the parameters passed in the request, as well as some meta-data such as the creation date of the file. The name of the file can be a hash of the API parameters (start, end, markers). This way if a user calls the API twice by accident with the same parameters, it will only create one tombstone. To keep track of the request state, filename extensions can be used. This will allow the tombstone files to be immutable. The 3 different file extensions will be `pending, processed, deleted`. Each time the deletion request moves to a new state, a new file will be added with the same deletion information but a different extension to indicate the new state. The file containing the previous state will be deleted once the new one is created. If a deletion request is cancelled, then a tombstone file with the `.deleted` filename extension will be created. When it is determined that the request should move to the next state, then it will first write a new file containing the tombstone information to the object store. The information inside the file will be the same except the `stateCreationTime`, which is replaced with the current timestamp. The extension of the new file will be different to reflect the new state. If the new file is successfully written, the file with the previous state is deleted. If the write of the new file fails, then the previous file is not going to be deleted. Next time the service runs to check the state of each tombstone, it will retry creating the new file with the updated state. If the write is successful but the deletion of the old file is unsuccessful then there will be 2 tombstone files with the same filename but different extension. When `BlocksCleaner` writes the tombstones to the bucket index, the compactor will check for duplicate tombstone files but with different extensions. It will use the tombstone with the most recently updated state and try to delete the file with the older state. There could be a scenario where there are two files with the same request ID but different extensions: {`.pending`, `.processed`} or {`.pending`, `.deleted`}. In this case, the `.processed` or `.deleted ` file will be selected as it is always the later state compared to the `pending` state. @@ -148,10 +148,10 @@ There are two potential caches that could contain deleted data, the chunks cache Firstly, the query results cache needs to be invalidated for each new delete request or a cancellation of one. This can be accomplished by utilizing cache generation numbers. For each tenant, their cache is prefixed with a cache generation number. When the query front-end discovers a cache generation number that is greater than the previous generation number, then it knows to invalidate the query results cache. However, the cache can only be invalidated once the queriers have loaded the tombstones from the bucket index and have begun filtering the data. Otherwise, to-be deleted data might show up in queries and be cached again. One of the way to guarantee that all the queriers are using the new tombstones is to wait until the bucket index staleness period has passed from the time the tombstones have been written to the bucket index. The staleness period can be configured using the following flag: `-blocks-storage.bucket-store.bucket-index.max-stale-period`. We can use the bucket index staleness period as the delay to wait before the cache generation number is increased. A query will fail inside the querier, if the bucket index last update is older the staleness period. Once this period is over, all the queriers should have the updated tombstones and the query results cache can be invalidated. Here is the proposed method for accomplishing this: -- The cache generation number will be a timestamp. The default value will be 0. +- The cache generation number will be a timestamp. The default value will be 0. - The bucket index will store the cache generation number. The query front-end will periodically fetch the bucket index. -- Inside the compactor, the _BlocksCleaner_ will load the tombstones from object store and update the bucket index accordingly. It will calculate the cache generation number by iterating through all the tombstones and their respective times (next bullet point) and selecting the maximum timestamp that is less than (current time minus `-blocks-storage.bucket-store.bucket-index.max-stale-period`). This would mean that if a deletion request is made or cancelled, the compactor will only update the cache generation number once the staleness period is over, ensuring that all queriers have the updated tombstones. -- For requests in a pending or processed state, the `requestCreationTime` will be used when comparing the maximum timestamps. If a request is in a deleted state, it will use the `stateCreationTime` for comparing the timestamps. This means that the cache gets invalidated only once it has been created or deleted, and the bucket index staleness period has passed. The cache will not be invalidated again when a request advances from pending to processed state. +- Inside the compactor, the _BlocksCleaner_ will load the tombstones from object store and update the bucket index accordingly. It will calculate the cache generation number by iterating through all the tombstones and their respective times (next bullet point) and selecting the maximum timestamp that is less than (current time minus `-blocks-storage.bucket-store.bucket-index.max-stale-period`). This would mean that if a deletion request is made or cancelled, the compactor will only update the cache generation number once the staleness period is over, ensuring that all queriers have the updated tombstones. +- For requests in a pending or processed state, the `requestCreationTime` will be used when comparing the maximum timestamps. If a request is in a deleted state, it will use the `stateCreationTime` for comparing the timestamps. This means that the cache gets invalidated only once it has been created or deleted, and the bucket index staleness period has passed. The cache will not be invalidated again when a request advances from pending to processed state. - The query front-end will fetch the cache generation number from the bucket index. The query front end will compare it to the current cache generation number stored in the front-end. If the cache generation number from the front-end is less than the one from bucket index, then the cache is invalidated. In regards to the chunks cache, since it is retrieved from the store gateway and passed to the querier, it will be filtered out like the rest of the time series data in the querier using the tombstones, with the mechanism described in the previous section. @@ -204,7 +204,7 @@ Once all the applicable blocks have been rewritten without the deleted data, the #### Cancelled Delete Requests -If a request was successfully cancelled, then a tombstone file a `.deleted` extension is created. This is done to help ensure that the cache generation number is updated and the query results cache is invalidated. The compactor's blocks cleaner can take care of cleaning up `.deleted` tombstones after a period of time of when they are no longer required for cache invalidation. This can be done after 10 times the bucket index max staleness time period has passed. Before removing the file from the object store, the current cache generation number must greater than or equal to when the tombstone was cancelled. +If a request was successfully cancelled, then a tombstone file a `.deleted` extension is created. This is done to help ensure that the cache generation number is updated and the query results cache is invalidated. The compactor's blocks cleaner can take care of cleaning up `.deleted` tombstones after a period of time of when they are no longer required for cache invalidation. This can be done after 10 times the bucket index max staleness time period has passed. Before removing the file from the object store, the current cache generation number must greater than or equal to when the tombstone was cancelled. #### Handling failed/unfinished delete jobs: @@ -223,7 +223,7 @@ If a request is made to delete a tenant, then all the tombstones will be deleted - In Prometheus, there is no delay. - One way to filter out immediately is to load the tombstones during query time but this will cause a negative performance impact. - Adding limits to the API such as: - - Max number of deletion requests allowed in the last 24 hours for a given tenent. + - Max number of deletion requests allowed in the last 24 hours for a given tenent. - Max number of pending tombstones for a given tenant.