Support zero-downtime vectorization on PgvectorDocumentStore #1696

antrix · 2025-05-03T12:54:17Z

Is your feature request related to a problem? Please describe.

In our current use case, we need to always recreate the table when running vectorization and inserting new documents into PgvectorDocumentStore. Due to this, during the time window when this operation is ongoing, any RAG pipelines that depend on the impacted document store are "offline". We need to either turn off the impacted pipelines or be okay with partial data.

Describe the solution you'd like

What would be cool is if there was an option to "swap" document stores. I am imagining a process like this: When we need to run vectorization, we create a new document store as a "temp" store. We insert all new documents in this "temp store". When ready, we ask haystack to switch the real store with this temp store. And then delete the temp store. Behind the scenes, it's essentially doing PG table renames in an atomic way.

Note sure what the API would look like to be honest!

Describe alternatives you've considered
Given the current implementation of PgvectorDocumenstore, didn't find any way to swap two stores.

Additional context
An alternative we could try would be to just insert documents in the store with the correct overwrite policy. The challenge is that in our setup, we don't have stable ids for the documents. So we can't reliably de-dup new inserts into the store.

The text was updated successfully, but these errors were encountered:

antrix added the feature request Ideas to improve an integration label May 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support zero-downtime vectorization on PgvectorDocumentStore #1696

Support zero-downtime vectorization on PgvectorDocumentStore #1696

antrix commented May 3, 2025

Support zero-downtime vectorization on PgvectorDocumentStore #1696

Support zero-downtime vectorization on PgvectorDocumentStore #1696

Comments

antrix commented May 3, 2025