_original_id should not be required in weaviate with Haystack 2.6 #1171

bwbw723 · 2024-11-11T10:06:30Z

I am using the WeaviateEmbeddingRetriever to work with the data.
It works fine with the default class in weaviate.
Once I change it to the data class created by myself with customized schema, I got the issue as below:

  File "/root/TS_ph3/00_WeaviateEmbeddingRetriever.py", line 70, in <module>
    result = query_pipeline.run({"text_embedder": {"text": query}})
  File "/root/.cache/pypoetry/virtualenvs/search-infra-7HLB3Aeo-py3.10/lib/python3.10/site-packages/haystack/core/pipeline/pipeline.py", line 229, in run
    res: Dict[str, Any] = self._run_component(name, components_inputs[name])
  File "/root/.cache/pypoetry/virtualenvs/search-infra-7HLB3Aeo-py3.10/lib/python3.10/site-packages/haystack/core/pipeline/pipeline.py", line 67, in _run_component
    res: Dict[str, Any] = instance.run(**inputs)
  File "/root/.cache/pypoetry/virtualenvs/search-infra-7HLB3Aeo-py3.10/lib/python3.10/site-packages/haystack_integrations/components/retrievers/weaviate/embedding_retriever.py", line 138, in run
    documents = self._document_store._embedding_retrieval(
  File "/root/.cache/pypoetry/virtualenvs/search-infra-7HLB3Aeo-py3.10/lib/python3.10/site-packages/haystack_integrations/document_stores/weaviate/document_store.py", line 538, in _embedding_retrieval
    return [self._to_document(doc) for doc in result.objects]
  File "/root/.cache/pypoetry/virtualenvs/search-infra-7HLB3Aeo-py3.10/lib/python3.10/site-packages/haystack_integrations/document_stores/weaviate/document_store.py", line 538, in <listcomp>
    return [self._to_document(doc) for doc in result.objects]
  File "/root/.cache/pypoetry/virtualenvs/search-infra-7HLB3Aeo-py3.10/lib/python3.10/site-packages/haystack_integrations/document_stores/weaviate/document_store.py", line 306, in _to_document
    document_data["id"] = document_data.pop("_original_id")
KeyError: '_original_id'

I check the codes and find that the predefined function need to get data of _original_id and set it as the Document ID.
I have updated the codes in document_store.py and set set document_data["id"] as generated UUID if the dataset does not have one.
In this case, the expected results are shown.
I do not think that the data in weaviate is forced to have the column as _original_id .
But based on the current codes, it will return errors if no _original_id there.
I prefer to have a if statement to handle the different cases.
Please kindly correct me if any misunderstandings.

The packages I am using are:
haystack-ai = "2.6.1"
fastembed-haystack = "1.3.0"
weaviate-client = "^4.9.0"
weaviate-haystack = "^4.0.0"

    def _to_document(self, data: DataObject[Dict[str, Any], None]) -> Document:
        """
        Converts a data object read from Weaviate into a Document.
        """
        document_data = data.properties
        # The error is raised here and I just set document_data["id"] as generated UUID if the dataset does not have one.
        document_data["id"] = document_data.pop("_original_id") 
        if isinstance(data.vector, List):
            document_data["embedding"] = data.vector
        elif isinstance(data.vector, Dict):
            document_data["embedding"] = data.vector.get("default")
        else:
            document_data["embedding"] = None

        if (blob_data := document_data.get("blob_data")) is not None:
            document_data["blob"] = {
                "data": base64.b64decode(blob_data),
                "mime_type": document_data.get("blob_mime_type"),
            }

        # We always delete these fields as they're not part of the Document dataclass

The text was updated successfully, but these errors were encountered:

anakin87 · 2024-11-15T14:48:10Z

The rationale behind this field is explained here:

haystack-core-integrations/integrations/weaviate/src/haystack_integrations/document_stores/weaviate/document_store.py

Lines 276 to 278 in 67e08d0

    
           # Weaviate forces a UUID as an id. 
        
           # We don't know if the id of our Document is a UUID or not, so we save it on a different field 
        
           # and let Weaviate a UUID that we're going to ignore completely.

This is done to provide a robust default to users who don't need serious customization.

For simplicity, you can add include this field to your collection configuration:

haystack-core-integrations/integrations/weaviate/src/haystack_integrations/document_stores/weaviate/document_store.py

Line 40 in 67e08d0

{"name": "_original_id", "dataType": ["text"]},

Does this create problems?

bwbw723 mentioned this issue Nov 11, 2024

_original_id should not be required in weaviate with Haystack 2.6 deepset-ai/haystack#8523

Closed

anakin87 added the integration:weaviate label Nov 12, 2024

anakin87 added the information-needed Information needed from the user label Nov 15, 2024

github-actions bot added the Stale label Dec 16, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_original_id should not be required in weaviate with Haystack 2.6 #1171

_original_id should not be required in weaviate with Haystack 2.6 #1171

bwbw723 commented Nov 11, 2024

anakin87 commented Nov 15, 2024

_original_id should not be required in weaviate with Haystack 2.6 #1171

_original_id should not be required in weaviate with Haystack 2.6 #1171

Comments

bwbw723 commented Nov 11, 2024

anakin87 commented Nov 15, 2024