Skip to content

provider: deduplicate cids in queue #910

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 24, 2025

Conversation

guillaumemichel
Copy link
Contributor

Superseeds #909

We cannot use a lru cache as a drop in replacement for the internal buffer queue. When an item is read from the internal buffer queue, we don't want it out of the lru cache for deduplication. Also we don't want to clear the lru cache every time the queue is persisted to the datastore.

Unfortunately, it is necessary to keep additional state. Note that the lru cache size can be reduced if deemed too large.

Alternatively, it would be more lean not to push duplicate in the queue in the first place, so that we don't any deduplication inside the provider queue. This is tracked in #901.

#907 doesn't need to wait on this PR

@guillaumemichel guillaumemichel requested a review from a team as a code owner April 16, 2025 07:50
@guillaumemichel guillaumemichel changed the title deduplication cache with additional state provider: deduplicate cids in queue Apr 16, 2025
Copy link
Contributor

@gammazero gammazero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Batch processing after removing items from the queue does some amount of deduplication, so I am not sure if this change will make a lot of difference.

We should wait until #901 is fixed, or decide this is the fix for #901.

@guillaumemichel
Copy link
Contributor Author

Batch processing after removing items from the queue does some amount of deduplication

#907 removes batching, so if we merge #907 without solving #901 we need deduplication in the queue directly (#910).

@lidel lidel mentioned this pull request Apr 22, 2025
39 tasks
@guillaumemichel
Copy link
Contributor Author

We discussed about using a 2Q cache instead of the LRU for efficiency. After thinking about it again, I don't think the 2Q will be more efficient since CIDs are reportedly added 3 times to the queue, so we don't benefit from the 1-hit feature of 2Q, and it costs extra memory.

The current solution is expected to be more efficient than previous small batches, and approx. as efficient as previous large batches. Its memory usage is constant (worse than previous small batches, but better than previous large batches).

I suggest we merge this, since it is expected to be a gradual improvement of the current situation. We can always revisit later.

WDYT @gammazero @lidel


Ideally, I think we need a dedicated datastore for managing reprovides. For deduplication, we could either:

  1. check membership in the reprovide datastore
  2. use bloom filter (or similar) generated from cids in datastore
    • size of bloom filter should be known beforehand to limit false positives (false negatives are fine, but we want to avoid false positives). Hence it may be necessary to generate new filters as size increases/once we can estimate the ingestion rate.

This is out of the scope of this PR.

@gammazero
Copy link
Contributor

I don't think the 2Q will be more efficient since CIDs are reportedly added 3 times to the queue, so we don't benefit from the 1-hit feature of 2Q, and it costs extra memory.

The 2Q should help with CIDs that are used repeatedly across different dags, those used more than the 3x duplications of most CIDs, as these will get promoted the high frequency queue. However, without testing I am not sure this how often some some CID are provided very frequently compared to all the rest, so it is possible 2Q only uses more memory without benefit.

I think for now the current solution can be merged. Additional enhancements can be attempted later.

@guillaumemichel guillaumemichel merged commit bace665 into main Apr 24, 2025
13 checks passed
@guillaumemichel guillaumemichel deleted the provider-queue-lru-deduplication branch April 24, 2025 07:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants