The Rerank model in RAG needs to support independent score_threshold and top_k #11068

hustyichi · 2024-11-25T09:46:50Z

Self Checks

I have searched for existing issues search for existing issues, including closed ones.
I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
Please do not modify this template :) and fill in all the required fields.

1. Is this request related to a challenge you're experiencing? Tell me about your story.

Currently, the retrieval model and the re-ranking stage share the score_threshold and top_k parameters, which currently have the following problems:

The retrieval model and the re-ranking model have different score distributions and need to use different thresholds, which cannot be supported under the current solution;
In order to ensure the integrity of the information, as much text as possible is generally recalled during the retrieval stage, and compressed to a relatively small number after the re-ranking is completed to match the input context of the large model. Currently, when top_k is shared, if top_k is set too small, the retrieval stage cannot recall enough content, affecting the recall rate, or if top_k is set too large, it exceeds the context of the large model;

2. Additional context or comments

No response

3. Can you help us with this feature?

I am interested in contributing to this feature.

nadirvishun · 2024-11-26T01:19:46Z

A lot of people have been giving feedback for a long time, like: #7187, and the officials are aware of it, but are slow to change it.

laipz8200 · 2024-12-03T14:16:43Z

I’ll ping our team about this feature request.

laipz8200 · 2024-12-06T10:32:08Z

Hi! The top_k and score_threshold settings in Workflow, Chatbot, or Agent do not override the configurations in the knowledge base. For your mentioned scenario, you can configure a higher value in the knowledge base, allowing Dify to retrieve more data during the search phase. Then, set lower values in Workflow, Chatbot, or Agent. This way, Dify will rerank the results and discard unnecessary ones after retrieval.

nadirvishun · 2024-12-07T08:55:56Z

I reviewed the process again, and there are two settings here:

Settings in the knowledge base, where its parameter TopK participates in both the first-stage vector retrieval and the rerank retrieval of the knowledge base itself.
Retrieval settings in chatFlow's knowledge base, where its parameter TopK is only used for reranking the documents aggregated from the multiple knowledge base retrievals again. Its value determines the maximum number of fragments passed to the large model.

Suppose I have 3 knowledge bases:

According to the current process, the knowledge base TopK is 50, rerank also return 50 records(useless effort, because everything will be reranked again afterwards), while the retrieval setting in chatFlow is 8. If aggregated, it would be 50+50+50=150, and then rerank out 8 records from 150. The data involved in the final rerank would be a lot, which should affect the speed.
If we distinguish search TopK as 50 and rerank TopN as 8 in the knowledge base, and the retrieval setting in chatFlow is 8, if aggregated, it would be 8+8+8=24, and then rerank out 8 records from 24. This would obviously be better.

I don't know if what I wrote above is accurate.

laipz8200 · 2024-12-08T06:19:17Z

You are correct, but this would result in a knowledge base filled with too many configurations. So, we don't have any plans to change it right now.

nadirvishun · 2024-12-09T00:01:47Z

@laipz8200
At least you can remerge the pull request: #11305

laipz8200 · 2024-12-09T04:02:05Z

@laipz8200 At least you can remerge the pull request: #11305

We won't merge it because its design is not as we expect, but you can still keep it in your version if you want.

k1endn · 2024-12-20T03:19:34Z

I reviewed the process again, and there are two settings here:

Settings in the knowledge base, where its parameter TopK participates in both the first-stage vector retrieval and the rerank retrieval of the knowledge base itself.

Retrieval settings in chatFlow's knowledge base, where its parameter TopK is only used for reranking the documents aggregated from the multiple knowledge base retrievals again. Its value determines the maximum number of fragments passed to the large model.

Suppose I have 3 knowledge bases:

According to the current process, the knowledge base TopK is 50, rerank also return 50 records(useless effort, because everything will be reranked again afterwards), while the retrieval setting in chatFlow is 8. If aggregated, it would be 50+50+50=150, and then rerank out 8 records from 150. The data involved in the final rerank would be a lot, which should affect the speed.

If we distinguish search TopK as 50 and rerank TopN as 8 in the knowledge base, and the retrieval setting in chatFlow is 8, if aggregated, it would be 8+8+8=24, and then rerank out 8 records from 24. This would obviously be better.

I don't know if what I wrote above is accurate.

I also consider what you mentioned. In the current ChatFlow process, we need to rerank twice: first in the knowledge base and then in the aggregated chunks from multiple datasets. The first rerank is not necessary.

laipz8200 · 2024-12-20T04:35:26Z

@hustyichi Thank you for your response. We understand the issue but believe this approach helps prevent adding too many settings to the knowledge base feature, making it easier for users to use. I have forwarded this issue to our product team, and it may change in the future.

dosubot bot added the 👻 feat:rag Embedding related issue, like qdrant, weaviate, milvus, vector database. label Nov 25, 2024

ProseGuys mentioned this issue Nov 26, 2024

feat: add retireval_top_n to config in env #11132

Merged

5 tasks

crazywoola closed this as completed in #11132 Nov 30, 2024

laipz8200 reopened this Dec 3, 2024

ProseGuys mentioned this issue Dec 3, 2024

FIX: Retrieval top n bug #11305

Closed

5 tasks

laipz8200 added the 💪 enhancement New feature or request label Dec 3, 2024

laipz8200 self-assigned this Dec 7, 2024

laipz8200 closed this as not planned Won't fix, can't repro, duplicate, stale Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Rerank model in RAG needs to support independent score_threshold and top_k #11068

The Rerank model in RAG needs to support independent score_threshold and top_k #11068

hustyichi commented Nov 25, 2024

nadirvishun commented Nov 26, 2024

laipz8200 commented Dec 3, 2024

laipz8200 commented Dec 6, 2024

nadirvishun commented Dec 7, 2024 •

edited

Loading

laipz8200 commented Dec 8, 2024

nadirvishun commented Dec 9, 2024

laipz8200 commented Dec 9, 2024

k1endn commented Dec 20, 2024

laipz8200 commented Dec 20, 2024

The Rerank model in RAG needs to support independent score_threshold and top_k #11068

The Rerank model in RAG needs to support independent score_threshold and top_k #11068

Comments

hustyichi commented Nov 25, 2024

Self Checks

1. Is this request related to a challenge you're experiencing? Tell me about your story.

2. Additional context or comments

3. Can you help us with this feature?

nadirvishun commented Nov 26, 2024

laipz8200 commented Dec 3, 2024

laipz8200 commented Dec 6, 2024

nadirvishun commented Dec 7, 2024 • edited Loading

laipz8200 commented Dec 8, 2024

nadirvishun commented Dec 9, 2024

laipz8200 commented Dec 9, 2024

k1endn commented Dec 20, 2024

laipz8200 commented Dec 20, 2024

nadirvishun commented Dec 7, 2024 •

edited

Loading