-
Notifications
You must be signed in to change notification settings - Fork 117
provider: deduplicate cids in queue #910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed about using a 2Q cache instead of the LRU for efficiency. After thinking about it again, I don't think the 2Q will be more efficient since CIDs are reportedly added 3 times to the queue, so we don't benefit from the 1-hit feature of 2Q, and it costs extra memory. The current solution is expected to be more efficient than previous small batches, and approx. as efficient as previous large batches. Its memory usage is constant (worse than previous small batches, but better than previous large batches). I suggest we merge this, since it is expected to be a gradual improvement of the current situation. We can always revisit later. WDYT @gammazero @lidel Ideally, I think we need a dedicated datastore for managing reprovides. For deduplication, we could either:
This is out of the scope of this PR. |
The 2Q should help with CIDs that are used repeatedly across different dags, those used more than the 3x duplications of most CIDs, as these will get promoted the high frequency queue. However, without testing I am not sure this how often some some CID are provided very frequently compared to all the rest, so it is possible 2Q only uses more memory without benefit. I think for now the current solution can be merged. Additional enhancements can be attempted later. |
Superseeds #909
We cannot use a lru cache as a drop in replacement for the internal buffer queue. When an item is read from the internal buffer queue, we don't want it out of the lru cache for deduplication. Also we don't want to clear the lru cache every time the queue is persisted to the datastore.
Unfortunately, it is necessary to keep additional state. Note that the lru cache size can be reduced if deemed too large.
Alternatively, it would be more lean not to push duplicate in the queue in the first place, so that we don't any deduplication inside the provider queue. This is tracked in #901.
#907 doesn't need to wait on this PR