Handle Document ids consistently and enable custom `id_hash_keys`

We discussed that ids are not handled consistently in Haystack in situations where meta data is updated by a component, for example LLMMetadataExtractor. We discussed [this PR](https://github.com/deepset-ai/haystack/pull/9553/files) and its implications with @tstadel @sjrl @ju-gu . 

We agreed that components that don’t change the content should not generate a new id with the exception of DocumentCleaner which has a `keep_id` parameter with the default value `false`. In other words, if only the meta data of documents is updated by a component, the document ids should remain unchanged in the output.

For enabling more customization in how ids are generated for newly initialized documents, we agreed that there are three options
- Adding `id_hash_keys` to all converters
- Adding a new component just before the DocumentWriter 
- Adding a new parameter to the DocumentWriter that enables generating new document ids based on `id_hash_keys`

Third option is preferred.
In addition, we discussed that we should not use the embedding field for document id generation but we're currently use it here https://github.com/deepset-ai/haystack/blob/main/haystack/dataclasses/document.py#L117




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle Document ids consistently and enable custom `id_hash_keys` #9561

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Handle Document ids consistently and enable custom id_hash_keys #9561

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Handle Document ids consistently and enable custom `id_hash_keys` #9561