Skip to content

feat: add Azure Search as VectorStore provider #2396

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

MatheusBordin
Copy link
Contributor

@MatheusBordin MatheusBordin commented Aug 24, 2023

Add support for Azure Cognitive Search vector store based on python implementation with some improvement's:

  • Add batch indexing on embeddings call.
  • Add batch indexing on azure-search call.
  • Add support for custom attributes (on document metadata).

TODO:

  • Query types support
    • Support for vector similarity search.
    • Support for hybrid search.
    • Support for hybrid search with semantic configuration.
  • Add automatic index creation. (with semantic and similarity configuration).
  • Improve addDocuments to perform document's. upload in batch.
  • Add test's cases
    • Unit Test: Add vectors should upload at max 100 documents up a time.
    • Unit Test: Add documents should embed (generate vectors) at max 16 documents up a time.
    • Unit Test: generated and external id's
    • Unit Test: search with all kinds of query (similarity, hybrid, ...)
    • Integration Test: index should be created if doesn't exists.
    • Integration Test: upload documents.
    • Integration Test: perform a search.
    • Integration Test: perform a search with filter.
    • Integration Test: delete documents by id.
    • Integration Test: delete many documents by filter.
    • Integration Test: search operation should works with query token when default client is used.
  • Add documentation.
    • Add quick start documentation.
    • Add samples working with different kinds of query.
    • Add samples working with custom attributes.
    • Add section explaining about Admin Key vs Query Key approaches. (Thanks @glorat)

@vercel
Copy link

vercel bot commented Aug 24, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
langchainjs-docs ✅ Ready (Inspect) Visit Preview Aug 24, 2023 3:05pm

@dosubot dosubot bot added the auto:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features label Aug 24, 2023
fields: [DEFAULT_FIELD_CONTENT_VECTOR],
kNearestNeighborsCount: k,
}],
filter,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like the other search methods, needs to have a top: k field passed in

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When search using vector similarity the parameter to limit the results is kNearestNeighborsCount.

const indexClient = new SearchIndexClient(endpoint, credential);

try {
await indexClient.getIndex(indexName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that even getIndex requires an admin key and will fail with a query key. So you pass in a query key, exception gets thrown, after which the createIndex call will also fail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the implementation requires an Admin Key, but if you want to use the query key you can also pass the SearchClient instead of indexName, doing that I'll not validate if index exists and all works.

That is the example:

AzureSearchStore.create({
    client: new SearchClient(...),
    search: {...}
});

I think is nice to have that in the documentation and some test's to validate that approach. Make sense?

@lucasbuges
Copy link

Up!

@MatheusBordin
Copy link
Contributor Author

I'm working on unit and integrate tests and until weekend I'll make the push.

@izzymsft
Copy link
Contributor

izzymsft commented Oct 7, 2023

Hi @MatheusBordin thanks for working on this. I see that you are already working on the tests. Please let me know if I can help or support in anyway. Thanks.

@dexteresc
Copy link

Is this still being worked on? If not I can take over from where @MatheusBordin started.

@izzymsft
Copy link
Contributor

Is this still being worked on? If not I can take over from where @MatheusBordin started.

@dexteresc I had the same question last week. Let's give Matheus a few days to respond (maybe until October 19) and then maybe we can collaborate to take it to the finish line. Quite a lot of folks are looking forward to this feature being added to Langchain JS. It's almost done from what I can see so far.

@MatheusBordin
Copy link
Contributor Author

Hy guys @izzymsft @dexteresc, the last weeks was crazy here, I needed to focus and paused the development of this feature. I'm submitting all changes that I already developed and I think that we can work together to finish all.

@izzymsft
Copy link
Contributor

Thank you for sharing the update @MatheusBordin

When you see this please let me know how I can assist. I can write tests, review the code and help with examples.

You can grant me access to your repo as a collaborator or if you would prefer I can fork your repo and send you PRs to your own fork of the original langchainjs repo.

@MatheusBordin
Copy link
Contributor Author

Hey @izzymsft , I'm working right now on the remaining tests (Integration), after that I think we can work on the documentation.

Yeah, I can add you as a collaborator, validate your access on the next 5 minutes please.

@izzymsft
Copy link
Contributor

Yes, I have accepted the invite and I was able to make an edit. Thanks.

MatheusBordin@dd2f25f

@MatheusBordin
Copy link
Contributor Author

@izzymsft thanks for your help.

I have pushed a commit with integration tests implemented, unfortunately they are not passing. Its seen to be a problem with my Open AI organization, the embedding api returns timeout. I'll try to fix it tomorrow. Anyway, this implementation is alredy been used for a couple of weeks in production and all works fine, fixing the implementation tests and writing the documentation I think we are ready to release.

@izzymsft
Copy link
Contributor

izzymsft commented Oct 19, 2023

@MatheusBordin no problem

I think for now you can switch to FakeEmbeddings (a mock embedding) so that the tests are not unstable.

You can see how other integration tests in the same directory (Redis/Chroma) are doing it.

I will attempt to run it with my Azure OpenAI settings to see if it is still failing tomorrow

@MatheusBordin
Copy link
Contributor Author

Status update: I pushed the remaining tests just right now. Just missing the documentation to publish the PR.

Thanks @izzymsft for the idea of using FakeEmbeddings on integration tests, I found some issues about the mock implementation but I'll not fix that in this Pull Request because it can late the release.

@MatheusBordin MatheusBordin marked this pull request as ready for review October 19, 2023 13:00
@MatheusBordin
Copy link
Contributor Author

Status update: I write the documentation with a simple quick start that I think is enough to publish. For now I'm publishing the feature to review, lets wait...

@izzymsft
Copy link
Contributor

@MatheusBordin thanks for the update. I have made some changes to the documentation as well to add some links with more information on the vector search capabilities. Let's see if the langchain maintainers have any feedback on the PR.

Thank you for all your efforts on taking this to the finish line.

@lucasbuges
Copy link

@MatheusBordin @izzymsft @glorat @leofmarciano thanks for the update!!

@jacoblee93
Copy link
Collaborator

Ah apologies, didn't realize this was ready! Will have a look now.

@@ -887,6 +891,9 @@
"@aws-sdk/client-sagemaker-runtime": "^3.310.0",
"@aws-sdk/client-sfn": "^3.310.0",
"@aws-sdk/credential-provider-node": "^3.388.0",
"@aws-sdk/protocol-http": "^3.374.0",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these AWS deps should be removed? Maybe a bad merge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, looks like a bad merge. I will remove that

/**
* Define metadata schema.
*
* If yout want to add custom data, use the attributes property.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small typo

*/
export type AzureSearchDocumentMetadata = {
source: string;
attributes?: Array<{ key: string; value: string; }>;
Copy link
Collaborator

@jacoblee93 jacoblee93 Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this restriction as many advanced retrievers will rely on being able to set metadata as they please (they won't be aware of the attributes field) - do we think we could just do the mapping from AzureSearchDocumentMetadata.attributes to arbitrary metadata?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or what is the purpose of source here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about how the developer will use that feature, I used the attributes field to prevent developer need to specify the Schema of your Document (like python langchain implementation does) because Cognitive Search has no Dynamic field.

But the pattern used on all others vector stores is not have that. Your suggestion about create a map é good, I will try to implement that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or what is the purpose of source here?

In my case the field 'source' is required because is the description about where the document is from. I'll remove this constraint.

content_vector: vectors[idx],
metadata: {
source: doc.metadata?.source,
attributes: doc.metadata?.attributes ?? [],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above

const searchType = this.params.search.type;
let results: [Document, number][] = [];

if (searchType === "similarity") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've tended to do this at the retriever level:

if (this.searchType === "mmr") {

But I'm ok with putting it here since a lot of these are quite specific.

Copy link
Collaborator

@jacoblee93 jacoblee93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm willing to merge this as is but see comments

@jacoblee93 jacoblee93 self-assigned this Oct 23, 2023
@jacoblee93 jacoblee93 added lgtm PRs that are ready to be merged as-is question Further information is requested labels Oct 23, 2023
@izzymsft
Copy link
Contributor

I'm willing to merge this as is but see comments

Thanks @jacoblee93 we will work on modifying it as soon as possible. I appreciate the feedback. We will take a look at the other implementations to see how it is done.

@jacoblee93
Copy link
Collaborator

jacoblee93 commented Oct 24, 2023

Here's an example of one circumstance where a retriever would require arbitrary metadata for docs:

https://js.langchain.com/docs/modules/data_connection/retrievers/how_to/multi-vector-retriever#summary

I would foresee more coming as well as it's quite useful for some advanced retrieval techniques.

@jacoblee93 jacoblee93 added the hold On hold label Oct 24, 2023
@deejiw
Copy link

deejiw commented Nov 20, 2023

Any update so far? This feature would be a big leap for JS flavor.

@farzad528
Copy link

Hi team, I wanted to alert everyone that we have a new stable release that I would recommend you all use for this PR. Please note they are a couple breaking changes.

See sample: https://github.com/Azure/azure-search-vector-samples/blob/main/demo-javascript/JavaScriptVectorDemo/code/azure-search-vector-sample.js
See source code: https://github.com/Azure/azure-sdk-for-js/tree/main/sdk/search/search-documents
See REST API docs: https://learn.microsoft.com/en-us/javascript/api/%40azure/search-documents/?view=azure-node-latest
cc: @izzymsft @MatheusBordin

@MatheusBordin
Copy link
Contributor Author

MatheusBordin commented Dec 13, 2023

Hi guys, I was working on another stuffs last weeks and can't made any progress here, but I'm working on this right now and maybe in a few days I'll have updates for this implementation.

@farzad528 thank you for your warning, I already made the updates on the code for the new API and all are working fine (I need to push the code yet).

@sinedied
Copy link
Contributor

@MatheusBordin thanks for your work on this!

Could I provide some help on this? Whether it's helping with the code, documentation, or tests I can help.
I'm working at Microsoft and can get in touch with the SDK team if needed.

I noticed you're using an old beta of the Document Search SDK, and the 12.0.0 stable version has been released with some breaking changes. It would also be helpful I think to add in the documentation how you can use managed identity instead of keys, as it's part of security best practices.

Don't hesitate to ping me if you'd like some help 🙂

@sinedied
Copy link
Contributor

Hi @MatheusBordin, I forked your original branch (keeping your commits and credits) and completed the PR here: #4044

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features hold On hold lgtm PRs that are ready to be merged as-is question Further information is requested
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants