Description
Discussed in #8513
Originally posted by bwbw723 November 1, 2024
I am using the WeaviateEmbeddingRetriever to work with the data.
It works fine with the default class in weaviate.
Once I change it to the data class created by myself with customized schema, I got the issue as below:
File "/root/TS_ph3/00_WeaviateEmbeddingRetriever.py", line 70, in <module>
result = query_pipeline.run({"text_embedder": {"text": query}})
File "/root/.cache/pypoetry/virtualenvs/search-infra-7HLB3Aeo-py3.10/lib/python3.10/site-packages/haystack/core/pipeline/pipeline.py", line 229, in run
res: Dict[str, Any] = self._run_component(name, components_inputs[name])
File "/root/.cache/pypoetry/virtualenvs/search-infra-7HLB3Aeo-py3.10/lib/python3.10/site-packages/haystack/core/pipeline/pipeline.py", line 67, in _run_component
res: Dict[str, Any] = instance.run(**inputs)
File "/root/.cache/pypoetry/virtualenvs/search-infra-7HLB3Aeo-py3.10/lib/python3.10/site-packages/haystack_integrations/components/retrievers/weaviate/embedding_retriever.py", line 138, in run
documents = self._document_store._embedding_retrieval(
File "/root/.cache/pypoetry/virtualenvs/search-infra-7HLB3Aeo-py3.10/lib/python3.10/site-packages/haystack_integrations/document_stores/weaviate/document_store.py", line 538, in _embedding_retrieval
return [self._to_document(doc) for doc in result.objects]
File "/root/.cache/pypoetry/virtualenvs/search-infra-7HLB3Aeo-py3.10/lib/python3.10/site-packages/haystack_integrations/document_stores/weaviate/document_store.py", line 538, in <listcomp>
return [self._to_document(doc) for doc in result.objects]
File "/root/.cache/pypoetry/virtualenvs/search-infra-7HLB3Aeo-py3.10/lib/python3.10/site-packages/haystack_integrations/document_stores/weaviate/document_store.py", line 306, in _to_document
document_data["id"] = document_data.pop("_original_id")
KeyError: '_original_id'
I check the codes and find that the predefined function need to get data of _original_id and set it as the Document ID.
I have updated the codes in document_store.py and set set document_data["id"] as generated UUID if the dataset does not have one.
In this case, the expected results are shown.
I do not think that the data in weaviate is forced to have the column as _original_id .
But based on the current codes, it will return errors if no _original_id there.
I prefer to have a if statement to handle the different cases.
Please kindly correct me if any misunderstandings.
The packages I am using are:
haystack-ai = "2.6.1"
fastembed-haystack = "1.3.0"
weaviate-client = "^4.9.0"
weaviate-haystack = "^4.0.0"
def _to_document(self, data: DataObject[Dict[str, Any], None]) -> Document:
"""
Converts a data object read from Weaviate into a Document.
"""
document_data = data.properties
# The error is raised here and I just set document_data["id"] as generated UUID if the dataset does not have one.
document_data["id"] = document_data.pop("_original_id")
if isinstance(data.vector, List):
document_data["embedding"] = data.vector
elif isinstance(data.vector, Dict):
document_data["embedding"] = data.vector.get("default")
else:
document_data["embedding"] = None
if (blob_data := document_data.get("blob_data")) is not None:
document_data["blob"] = {
"data": base64.b64decode(blob_data),
"mime_type": document_data.get("blob_mime_type"),
}
# We always delete these fields as they're not part of the Document dataclass
```</div>