provider: deduplicate cids in queue #910

guillaumemichel · 2025-04-16T07:50:05Z

Superseeds #909

We cannot use a lru cache as a drop in replacement for the internal buffer queue. When an item is read from the internal buffer queue, we don't want it out of the lru cache for deduplication. Also we don't want to clear the lru cache every time the queue is persisted to the datastore.

Unfortunately, it is necessary to keep additional state. Note that the lru cache size can be reduced if deemed too large.

Alternatively, it would be more lean not to push duplicate in the queue in the first place, so that we don't any deduplication inside the provider queue. This is tracked in #901.

#907 doesn't need to wait on this PR

gammazero

Batch processing after removing items from the queue does some amount of deduplication, so I am not sure if this change will make a lot of difference.

We should wait until #901 is fixed, or decide this is the fix for #901.

provider/internal/queue/queue.go

guillaumemichel · 2025-04-22T11:58:13Z

Batch processing after removing items from the queue does some amount of deduplication

#907 removes batching, so if we merge #907 without solving #901 we need deduplication in the queue directly (#910).

guillaumemichel · 2025-04-23T08:12:27Z

We discussed about using a 2Q cache instead of the LRU for efficiency. After thinking about it again, I don't think the 2Q will be more efficient since CIDs are reportedly added 3 times to the queue, so we don't benefit from the 1-hit feature of 2Q, and it costs extra memory.

The current solution is expected to be more efficient than previous small batches, and approx. as efficient as previous large batches. Its memory usage is constant (worse than previous small batches, but better than previous large batches).

I suggest we merge this, since it is expected to be a gradual improvement of the current situation. We can always revisit later.

WDYT @gammazero @lidel

Ideally, I think we need a dedicated datastore for managing reprovides. For deduplication, we could either:

check membership in the reprovide datastore
use bloom filter (or similar) generated from cids in datastore
- size of bloom filter should be known beforehand to limit false positives (false negatives are fine, but we want to avoid false positives). Hence it may be necessary to generate new filters as size increases/once we can estimate the ingestion rate.

This is out of the scope of this PR.

gammazero · 2025-04-24T06:54:29Z

I don't think the 2Q will be more efficient since CIDs are reportedly added 3 times to the queue, so we don't benefit from the 1-hit feature of 2Q, and it costs extra memory.

The 2Q should help with CIDs that are used repeatedly across different dags, those used more than the 3x duplications of most CIDs, as these will get promoted the high frequency queue. However, without testing I am not sure this how often some some CID are provided very frequently compared to all the rest, so it is possible 2Q only uses more memory without benefit.

I think for now the current solution can be merged. Additional enhancements can be attempted later.

deduplication cache with additional state

56fb218

guillaumemichel requested a review from a team as a code owner April 16, 2025 07:50

guillaumemichel changed the title ~~deduplication cache with additional state~~ provider: deduplicate cids in queue Apr 16, 2025

changelog

5955d27

This was referenced Apr 16, 2025

provider: depuplicate cids in provider queue #909

Closed

provider: dedicated provide queue #907

Merged

tests: ensure kubo is providing cid only once ipfs/kubo#10784

Closed

gammazero requested changes Apr 17, 2025

View reviewed changes

provider/internal/queue/queue.go Outdated Show resolved Hide resolved

provider/internal/queue/queue.go Outdated Show resolved Hide resolved

address review

969cfb5

lidel mentioned this pull request Apr 22, 2025

Release 0.35 ipfs/kubo#10760

Open

39 tasks

guillaumemichel added 2 commits April 23, 2025 09:56

update recentness on cache hit

c745bf7

increase dedupCacheSize

968098c

guillaumemichel requested a review from gammazero April 23, 2025 08:37

gammazero approved these changes Apr 24, 2025

View reviewed changes

guillaumemichel merged commit bace665 into main Apr 24, 2025
13 checks passed

guillaumemichel deleted the provider-queue-lru-deduplication branch April 24, 2025 07:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

provider: deduplicate cids in queue #910

provider: deduplicate cids in queue #910

guillaumemichel commented Apr 16, 2025

gammazero left a comment

guillaumemichel commented Apr 22, 2025

guillaumemichel commented Apr 23, 2025

gammazero commented Apr 24, 2025

provider: deduplicate cids in queue #910

provider: deduplicate cids in queue #910

Conversation

guillaumemichel commented Apr 16, 2025

gammazero left a comment

Choose a reason for hiding this comment

guillaumemichel commented Apr 22, 2025

guillaumemichel commented Apr 23, 2025

gammazero commented Apr 24, 2025