You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
I am building a RAG pipeline, using the OpeSearchDocumentStore as the vector store. I would like to use a custom query which allows one to "filter" by metadata fields (in addition to using the $query_embedding). In other words, I would like to use a retriever which allows me to do the usual embedding search on the content along with a full-text query search on metadata fields. The Haystack metadata filter (along with the embedding retriever) does not work for my use case due to its limited filtering functionality.
Describe the solution you'd like
A clear and concise description of what you want to happen.
I am not sure about this, but I think one solution to the above problem might be to add a search_fields feature/functionality which was added to Haystack 1.0, but is not present in Haystack 2.0 (btw, when I added a search_fields argument to the OpenSearchDocumentStore with Haystack 2.0, it did not throw an exception. I think if search_fields are not allowed, an exception should be thrown). Any other solution to my problem is also welcome. I should add that using the BM25Retriever for the full-text query and joining the result of that with that from an EmbeddingRetriever would not work for my use case; I would like to be able to do the semantic search only on those document chunks that are associated with the file with a file name that matches a text string ("Coral Gold Resources" or "CoralGoldResources", in my example below), otherwise the search space is too large (there are hundreds of files to search from).
Also, please let me know of any workaround you recommend until the requested functionality (if you agree with it!) is productionized. For example, would the QdrantDocumentStore along with QdrantHybridRetriever (I just came across this) work for my situation? P.S. I tried the Qdrant fastembed hybrid search (with sparse vector embedding, specifying meta_fileds_to_embed), but the results were not so good, because the sparse embedding search also searches the context, whereas I want it to only search the metadata.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
I have tried to use search_fields (in Haystack 2.0) and tried various custom queries but none of them worked. The below embedding-based query works as expected:
But the next one below does not retrieve any results. Here, the file name is a metadata field of the chunks; the file name is "CORALGOLDRESOURCES,LTD_05_28_2020-EX-4.1-CONSULTING AGREEMENT.md".
I am also using the following embedders and retriever:
document_embedder = SentenceTransformersDocumentEmbedder(
model=embed_model, device=ComponentDevice.from_str("cuda:0"),
trust_remote_code=True, # for embeddings like nomic-ai/nomic-embed-text-v1
meta_fields_to_embed=meta_fields_to_embed
)
text_embedder = SentenceTransformersTextEmbedder(model=embed_model, device=ComponentDevice.from_str("cuda:0"))
embedding_retriever = OpenSearchEmbeddingRetriever(document_store=document_store)
The query pipeline is run as below:
result = query_pipeline.run({"text_embedder": {"text": query_}, "embedding_retriever": {"custom_query": custom_query_with_metadata_filter}})
I would like to have the below custom query return chunks where the content is semantically similar to the query and the file name (which is a metadata field of the chunks) contains the text "Coral Gold Resources" or "CoralGoldResources".
custom_query_with_metadata_filter = {
"query": {
"bool": {
"must": [ # must -> boolean and for each query in the list; should -> boolean or for each query in the list
{
"knn": {
"embedding": {
"vector": "$query_embedding",
"k": 100,
}
}
},
{
"bool": {
"should": [
{
"match": {
"file_path": "Coral Gold Resources"
}
},
{
"match": {
"file_path": "CoralGoldResources"
}
}
]
}
}
]
}
}
}
Thanks in advance for taking to the time to look into this.
The text was updated successfully, but these errors were encountered:
I'm just following up on this. I see it has been tagged as a feature request. However, in the "Describe a Solution you would like" section of my request, I had asked if my suggested "workaround" could be made to work. I would appreciate it if someone could get back to me about a short-term solution that you could code, while we wait for a response and resolution to the feature request.
In general, I must say you guys have been very responsive. Thanks v much in advance.
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
I am building a RAG pipeline, using the OpeSearchDocumentStore as the vector store. I would like to use a custom query which allows one to "filter" by metadata fields (in addition to using the $query_embedding). In other words, I would like to use a retriever which allows me to do the usual embedding search on the content along with a full-text query search on metadata fields. The Haystack metadata filter (along with the embedding retriever) does not work for my use case due to its limited filtering functionality.
Describe the solution you'd like
A clear and concise description of what you want to happen.
I am not sure about this, but I think one solution to the above problem might be to add a search_fields feature/functionality which was added to Haystack 1.0, but is not present in Haystack 2.0 (btw, when I added a search_fields argument to the OpenSearchDocumentStore with Haystack 2.0, it did not throw an exception. I think if search_fields are not allowed, an exception should be thrown). Any other solution to my problem is also welcome. I should add that using the BM25Retriever for the full-text query and joining the result of that with that from an EmbeddingRetriever would not work for my use case; I would like to be able to do the semantic search only on those document chunks that are associated with the file with a file name that matches a text string ("Coral Gold Resources" or "CoralGoldResources", in my example below), otherwise the search space is too large (there are hundreds of files to search from).
Also, please let me know of any workaround you recommend until the requested functionality (if you agree with it!) is productionized. For example, would the QdrantDocumentStore along with QdrantHybridRetriever (I just came across this) work for my situation? P.S. I tried the Qdrant fastembed hybrid search (with sparse vector embedding, specifying meta_fileds_to_embed), but the results were not so good, because the sparse embedding search also searches the context, whereas I want it to only search the metadata.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
I have tried to use search_fields (in Haystack 2.0) and tried various custom queries but none of them worked. The below embedding-based query works as expected:
But the next one below does not retrieve any results. Here, the file name is a metadata field of the chunks; the file name is "CORALGOLDRESOURCES,LTD_05_28_2020-EX-4.1-CONSULTING AGREEMENT.md".
Additional context
Add any other context or screenshots about the feature request here.
I am using the following document store:
I am also using the following embedders and retriever:
The query pipeline is run as below:
result = query_pipeline.run({"text_embedder": {"text": query_}, "embedding_retriever": {"custom_query": custom_query_with_metadata_filter}})
I would like to have the below custom query return chunks where the content is semantically similar to the query and the file name (which is a metadata field of the chunks) contains the text "Coral Gold Resources" or "CoralGoldResources".
Thanks in advance for taking to the time to look into this.
The text was updated successfully, but these errors were encountered: