Skip to content

Pgvector hybrid #576

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 17 commits into from
Closed

Pgvector hybrid #576

wants to merge 17 commits into from

Conversation

jlonge4
Copy link
Contributor

@jlonge4 jlonge4 commented Mar 13, 2024

I was hoping to implement hybrid search within the pgvector integration and use RRF for merging. This has one piece missing, and that is the user query itself (commented out in line 571. I didn't want to open an issue/enhancement and not at least provide a starting point. Let me know what you think @anakin87 @vblagoje @masci

Inspiration -> https://github.com/pgvector/pgvector-python/blob/master/examples/hybrid_search_rrf.py

@jlonge4 jlonge4 requested a review from a team as a code owner March 13, 2024 01:01
@jlonge4 jlonge4 requested review from anakin87 and removed request for a team March 13, 2024 01:01
@CLAassistant
Copy link

CLAassistant commented Mar 13, 2024

CLA assistant check
All committers have signed the CLA.

@anakin87
Copy link
Member

Hey, @jlonge4!

Thanks for your idea/initial implementation.

I know this is a missing feature of this Document Store.
I'll investigate and get back to you in a while...

@jlonge4
Copy link
Contributor Author

jlonge4 commented Mar 13, 2024

@anakin87 thanks a lot let me know if I can do anything further!

@github-actions github-actions bot added the type:documentation Improvements or additions to documentation label Mar 20, 2024
@anakin87
Copy link
Member

Hey @jlonge4, sorry for the long wait... Tomorrow I will take a proper look!

@jlonge4
Copy link
Contributor Author

jlonge4 commented Mar 21, 2024

@anakin87 no worries sir! Thank you 🙏🏼

@anakin87
Copy link
Member

Before talking about hybrid retrieval, we should introduce keyword retrieval. Then we can combine vector+keyword -> hybrid retrieval.

Keyword Retrieval

TO DO (brainstorming mode 🙂)

  • add some configurations to the Document Store (language)
  • create another index on the DB as done here
  • introduce a method _keyword_retrieval in the Document Store
  • create a KeywordRetriever that accepts a query and calls the previous method

(unsure about the name)

WDYT?

@jlonge4
Copy link
Contributor Author

jlonge4 commented Mar 22, 2024

@anakin87 sounds like a great plan, bite sized pieces are better! Made a few updates based on your thoughts.

@anakin87 anakin87 self-assigned this Mar 25, 2024
@anakin87
Copy link
Member

Hey!

  • can you fix the linting error?
  • can you add some tests for this addition?

@jlonge4
Copy link
Contributor Author

jlonge4 commented Mar 27, 2024

@anakin87 Definitely, I am gonna do some local testing (might take a couple days) but will get there!

@anakin87
Copy link
Member

Take your time...
I will be off for a few days 🙂

@anakin87
Copy link
Member

anakin87 commented Apr 3, 2024

Hey, I see that tests are no longer running for some reason.

I would suggest to focus on Keyword Retrieval and then think about Hybrid Retrieval (maybe in another PR).

Let me know if you need any help or suggestions...

@jlonge4
Copy link
Contributor Author

jlonge4 commented Apr 3, 2024

@anakin87 you are right, I believe I'll kill this PR and do as you suggested 👍🏼

@jlonge4 jlonge4 closed this Apr 5, 2024
@anakin87 anakin87 removed their assignment Apr 5, 2024
@kanenorman
Copy link
Contributor

kanenorman commented Oct 17, 2024

Hi @anakin87, sorry to resurface an old issue.

Are there any plans to introduce a PgvectorHybridRetriever, or is the expectation that hybrid retrieval should be implemented using a pipeline, similar to #738? I’ve noticed that some integrations, like QdrantHybridRetriever, already have hybrid retrievers. Does the core team prefer users implement hybrid retrieval through a pipeline instead?

@anakin87
Copy link
Member

Hello @kanenorman!

There is no set rule.

Our main goal is to provide users with hybrid retrieval capabilities, if available (using a Pipeline or not).

I would say that implementing a Hybrid Retriever makes sense especially when:

  • there is a significant difference in query times compared to using a hybrid retrieval Pipeline (due to a single optimized query in the DB + Pipeline overhead)
  • the community requests this feature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration:pgvector type:documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants