Skip to content

The Rerank model in RAG needs to support independent score_threshold and top_k #11068

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 of 5 tasks
hustyichi opened this issue Nov 25, 2024 · 9 comments · Fixed by #11132
Closed
4 of 5 tasks

The Rerank model in RAG needs to support independent score_threshold and top_k #11068

hustyichi opened this issue Nov 25, 2024 · 9 comments · Fixed by #11132
Assignees
Labels
💪 enhancement New feature or request 👻 feat:rag Embedding related issue, like qdrant, weaviate, milvus, vector database.

Comments

@hustyichi
Copy link
Contributor

Self Checks

  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template :) and fill in all the required fields.

1. Is this request related to a challenge you're experiencing? Tell me about your story.

Currently, the retrieval model and the re-ranking stage share the score_threshold and top_k parameters, which currently have the following problems:

  1. The retrieval model and the re-ranking model have different score distributions and need to use different thresholds, which cannot be supported under the current solution;
  2. In order to ensure the integrity of the information, as much text as possible is generally recalled during the retrieval stage, and compressed to a relatively small number after the re-ranking is completed to match the input context of the large model. Currently, when top_k is shared, if top_k is set too small, the retrieval stage cannot recall enough content, affecting the recall rate, or if top_k is set too large, it exceeds the context of the large model;

2. Additional context or comments

No response

3. Can you help us with this feature?

  • I am interested in contributing to this feature.
@dosubot dosubot bot added the 👻 feat:rag Embedding related issue, like qdrant, weaviate, milvus, vector database. label Nov 25, 2024
@nadirvishun
Copy link

A lot of people have been giving feedback for a long time, like: #7187, and the officials are aware of it, but are slow to change it.

@laipz8200
Copy link
Member

I’ll ping our team about this feature request.

@laipz8200
Copy link
Member

Hi! The top_k and score_threshold settings in Workflow, Chatbot, or Agent do not override the configurations in the knowledge base. For your mentioned scenario, you can configure a higher value in the knowledge base, allowing Dify to retrieve more data during the search phase. Then, set lower values in Workflow, Chatbot, or Agent. This way, Dify will rerank the results and discard unnecessary ones after retrieval.

@laipz8200 laipz8200 self-assigned this Dec 7, 2024
@nadirvishun
Copy link

nadirvishun commented Dec 7, 2024

I reviewed the process again, and there are two settings here:

  • Settings in the knowledge base, where its parameter TopK participates in both the first-stage vector retrieval and the rerank retrieval of the knowledge base itself.
  • Retrieval settings in chatFlow's knowledge base, where its parameter TopK is only used for reranking the documents aggregated from the multiple knowledge base retrievals again. Its value determines the maximum number of fragments passed to the large model.

image

Suppose I have 3 knowledge bases:

  • According to the current process, the knowledge base TopK is 50, rerank also return 50 records(useless effort, because everything will be reranked again afterwards), while the retrieval setting in chatFlow is 8. If aggregated, it would be 50+50+50=150, and then rerank out 8 records from 150. The data involved in the final rerank would be a lot, which should affect the speed.
  • If we distinguish search TopK as 50 and rerank TopN as 8 in the knowledge base, and the retrieval setting in chatFlow is 8, if aggregated, it would be 8+8+8=24, and then rerank out 8 records from 24. This would obviously be better.

I don't know if what I wrote above is accurate.

@laipz8200
Copy link
Member

You are correct, but this would result in a knowledge base filled with too many configurations. So, we don't have any plans to change it right now.

@nadirvishun
Copy link

@laipz8200
At least you can remerge the pull request: #11305

@laipz8200
Copy link
Member

@laipz8200 At least you can remerge the pull request: #11305

We won't merge it because its design is not as we expect, but you can still keep it in your version if you want.

@laipz8200 laipz8200 closed this as not planned Won't fix, can't repro, duplicate, stale Dec 9, 2024
@k1endn
Copy link

k1endn commented Dec 20, 2024

I reviewed the process again, and there are two settings here:

  • Settings in the knowledge base, where its parameter TopK participates in both the first-stage vector retrieval and the rerank retrieval of the knowledge base itself.
  • Retrieval settings in chatFlow's knowledge base, where its parameter TopK is only used for reranking the documents aggregated from the multiple knowledge base retrievals again. Its value determines the maximum number of fragments passed to the large model.

image

Suppose I have 3 knowledge bases:

  • According to the current process, the knowledge base TopK is 50, rerank also return 50 records(useless effort, because everything will be reranked again afterwards), while the retrieval setting in chatFlow is 8. If aggregated, it would be 50+50+50=150, and then rerank out 8 records from 150. The data involved in the final rerank would be a lot, which should affect the speed.
  • If we distinguish search TopK as 50 and rerank TopN as 8 in the knowledge base, and the retrieval setting in chatFlow is 8, if aggregated, it would be 8+8+8=24, and then rerank out 8 records from 24. This would obviously be better.

I don't know if what I wrote above is accurate.

I also consider what you mentioned. In the current ChatFlow process, we need to rerank twice: first in the knowledge base and then in the aggregated chunks from multiple datasets. The first rerank is not necessary.

@laipz8200
Copy link
Member

@hustyichi Thank you for your response. We understand the issue but believe this approach helps prevent adding too many settings to the knowledge base feature, making it easier for users to use. I have forwarded this issue to our product team, and it may change in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💪 enhancement New feature or request 👻 feat:rag Embedding related issue, like qdrant, weaviate, milvus, vector database.
Projects
None yet
4 participants