Description
We discussed that ids are not handled consistently in Haystack in situations where meta data is updated by a component, for example LLMMetadataExtractor. We discussed this PR and its implications with @tstadel @sjrl @ju-gu .
We agreed that components that don’t change the content should not generate a new id with the exception of DocumentCleaner which has a keep_id
parameter with the default value false
. In other words, if only the meta data of documents is updated by a component, the document ids should remain unchanged in the output.
For enabling more customization in how ids are generated for newly initialized documents, we agreed that there are three options
- Adding
id_hash_keys
to all converters - Adding a new component just before the DocumentWriter
- Adding a new parameter to the DocumentWriter that enables generating new document ids based on
id_hash_keys
Third option is preferred.
In addition, we discussed that we should not use the embedding field for document id generation but we're currently use it here https://github.com/deepset-ai/haystack/blob/main/haystack/dataclasses/document.py#L117