Skip to content

Handle Document ids consistently and enable custom id_hash_keys #9561

Open
@julian-risch

Description

@julian-risch

We discussed that ids are not handled consistently in Haystack in situations where meta data is updated by a component, for example LLMMetadataExtractor. We discussed this PR and its implications with @tstadel @sjrl @ju-gu .

We agreed that components that don’t change the content should not generate a new id with the exception of DocumentCleaner which has a keep_id parameter with the default value false. In other words, if only the meta data of documents is updated by a component, the document ids should remain unchanged in the output.

For enabling more customization in how ids are generated for newly initialized documents, we agreed that there are three options

  • Adding id_hash_keys to all converters
  • Adding a new component just before the DocumentWriter
  • Adding a new parameter to the DocumentWriter that enables generating new document ids based on id_hash_keys

Third option is preferred.
In addition, we discussed that we should not use the embedding field for document id generation but we're currently use it here https://github.com/deepset-ai/haystack/blob/main/haystack/dataclasses/document.py#L117

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium priority, add to the next sprint if no P1 available

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions