-
Notifications
You must be signed in to change notification settings - Fork 2.6k
feat: add Azure Search as VectorStore provider #2396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
fields: [DEFAULT_FIELD_CONTENT_VECTOR], | ||
kNearestNeighborsCount: k, | ||
}], | ||
filter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like the other search methods, needs to have a top: k
field passed in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When search using vector similarity the parameter to limit the results is kNearestNeighborsCount
.
const indexClient = new SearchIndexClient(endpoint, credential); | ||
|
||
try { | ||
await indexClient.getIndex(indexName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that even getIndex requires an admin key and will fail with a query key. So you pass in a query key, exception gets thrown, after which the createIndex call will also fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the implementation requires an Admin Key, but if you want to use the query key you can also pass the SearchClient instead of indexName, doing that I'll not validate if index exists and all works.
That is the example:
AzureSearchStore.create({
client: new SearchClient(...),
search: {...}
});
I think is nice to have that in the documentation and some test's to validate that approach. Make sense?
Up! |
I'm working on unit and integrate tests and until weekend I'll make the push. |
Hi @MatheusBordin thanks for working on this. I see that you are already working on the tests. Please let me know if I can help or support in anyway. Thanks. |
Is this still being worked on? If not I can take over from where @MatheusBordin started. |
@dexteresc I had the same question last week. Let's give Matheus a few days to respond (maybe until October 19) and then maybe we can collaborate to take it to the finish line. Quite a lot of folks are looking forward to this feature being added to Langchain JS. It's almost done from what I can see so far. |
Hy guys @izzymsft @dexteresc, the last weeks was crazy here, I needed to focus and paused the development of this feature. I'm submitting all changes that I already developed and I think that we can work together to finish all. |
Thank you for sharing the update @MatheusBordin When you see this please let me know how I can assist. I can write tests, review the code and help with examples. You can grant me access to your repo as a collaborator or if you would prefer I can fork your repo and send you PRs to your own fork of the original langchainjs repo. |
Hey @izzymsft , I'm working right now on the remaining tests (Integration), after that I think we can work on the documentation. Yeah, I can add you as a collaborator, validate your access on the next 5 minutes please. |
Yes, I have accepted the invite and I was able to make an edit. Thanks. |
@izzymsft thanks for your help. I have pushed a commit with integration tests implemented, unfortunately they are not passing. Its seen to be a problem with my Open AI organization, the embedding api returns timeout. I'll try to fix it tomorrow. Anyway, this implementation is alredy been used for a couple of weeks in production and all works fine, fixing the implementation tests and writing the documentation I think we are ready to release. |
@MatheusBordin no problem I think for now you can switch to FakeEmbeddings (a mock embedding) so that the tests are not unstable. You can see how other integration tests in the same directory (Redis/Chroma) are doing it. I will attempt to run it with my Azure OpenAI settings to see if it is still failing tomorrow |
Status update: I pushed the remaining tests just right now. Just missing the documentation to publish the PR. Thanks @izzymsft for the idea of using FakeEmbeddings on integration tests, I found some issues about the mock implementation but I'll not fix that in this Pull Request because it can late the release. |
…to feat/azure-search
Status update: I write the documentation with a simple quick start that I think is enough to publish. For now I'm publishing the feature to review, lets wait... |
@MatheusBordin thanks for the update. I have made some changes to the documentation as well to add some links with more information on the vector search capabilities. Let's see if the langchain maintainers have any feedback on the PR. Thank you for all your efforts on taking this to the finish line. |
@MatheusBordin @izzymsft @glorat @leofmarciano thanks for the update!! |
Ah apologies, didn't realize this was ready! Will have a look now. |
@@ -887,6 +891,9 @@ | |||
"@aws-sdk/client-sagemaker-runtime": "^3.310.0", | |||
"@aws-sdk/client-sfn": "^3.310.0", | |||
"@aws-sdk/credential-provider-node": "^3.388.0", | |||
"@aws-sdk/protocol-http": "^3.374.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these AWS deps should be removed? Maybe a bad merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, looks like a bad merge. I will remove that
/** | ||
* Define metadata schema. | ||
* | ||
* If yout want to add custom data, use the attributes property. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small typo
*/ | ||
export type AzureSearchDocumentMetadata = { | ||
source: string; | ||
attributes?: Array<{ key: string; value: string; }>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love this restriction as many advanced retrievers will rely on being able to set metadata as they please (they won't be aware of the attributes
field) - do we think we could just do the mapping from AzureSearchDocumentMetadata.attributes
to arbitrary metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or what is the purpose of source
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about how the developer will use that feature, I used the attributes field to prevent developer need to specify the Schema of your Document (like python langchain implementation does) because Cognitive Search has no Dynamic field.
But the pattern used on all others vector stores is not have that. Your suggestion about create a map é good, I will try to implement that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or what is the purpose of
source
here?
In my case the field 'source' is required because is the description about where the document is from. I'll remove this constraint.
content_vector: vectors[idx], | ||
metadata: { | ||
source: doc.metadata?.source, | ||
attributes: doc.metadata?.attributes ?? [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above
const searchType = this.params.search.type; | ||
let results: [Document, number][] = []; | ||
|
||
if (searchType === "similarity") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've tended to do this at the retriever level:
if (this.searchType === "mmr") { |
But I'm ok with putting it here since a lot of these are quite specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm willing to merge this as is but see comments
Thanks @jacoblee93 we will work on modifying it as soon as possible. I appreciate the feedback. We will take a look at the other implementations to see how it is done. |
Here's an example of one circumstance where a retriever would require arbitrary metadata for docs: I would foresee more coming as well as it's quite useful for some advanced retrieval techniques. |
Any update so far? This feature would be a big leap for JS flavor. |
Hi team, I wanted to alert everyone that we have a new stable release that I would recommend you all use for this PR. Please note they are a couple breaking changes. See sample: https://github.com/Azure/azure-search-vector-samples/blob/main/demo-javascript/JavaScriptVectorDemo/code/azure-search-vector-sample.js |
Hi guys, I was working on another stuffs last weeks and can't made any progress here, but I'm working on this right now and maybe in a few days I'll have updates for this implementation. @farzad528 thank you for your warning, I already made the updates on the code for the new API and all are working fine (I need to push the code yet). |
@MatheusBordin thanks for your work on this! Could I provide some help on this? Whether it's helping with the code, documentation, or tests I can help. I noticed you're using an old beta of the Document Search SDK, and the 12.0.0 stable version has been released with some breaking changes. It would also be helpful I think to add in the documentation how you can use managed identity instead of keys, as it's part of security best practices. Don't hesitate to ping me if you'd like some help 🙂 |
Hi @MatheusBordin, I forked your original branch (keeping your commits and credits) and completed the PR here: #4044 |
Add support for Azure Cognitive Search vector store based on python implementation with some improvement's:
TODO:
addDocuments
to perform document's. upload in batch.similarity
,hybrid
, ...)id
.filter
.