Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MEVD] Filter-only search API (without vector similarity) #10295

Closed
roji opened this issue Jan 25, 2025 · 12 comments
Closed

[MEVD] Filter-only search API (without vector similarity) #10295

roji opened this issue Jan 25, 2025 · 12 comments
Assignees
Labels
Build Features planned for next Build conference memory msft.ext.vectordata Related to Microsoft.Extensions.VectorData .NET Issue or Pull requests regarding .NET code sk team issue A tag to denote issues that where created by the Semantic Kernel team (i.e., not the community)

Comments

@roji
Copy link
Member

roji commented Jan 25, 2025

We have seen requests for doing criteria filtering without vector similarity; that's important e.g. for synchronizing data, checking if records already exist in the database before updating them, and similar scenarios.

Once #10156 is done, doing this should be quite easy.

/cc @SteveSanderson

@roji roji added .NET Issue or Pull requests regarding .NET code memory msft.ext.vectordata Related to Microsoft.Extensions.VectorData labels Jan 25, 2025
@roji roji self-assigned this Jan 25, 2025
@github-actions github-actions bot changed the title .NET MEVD: filter-only search API (without vector similarity) .Net MEVD: filter-only search API (without vector similarity) Jan 25, 2025
@evchaki evchaki added the sk team issue A tag to denote issues that where created by the Semantic Kernel team (i.e., not the community) label Jan 31, 2025
@roji roji assigned adamsitnik and unassigned roji Feb 6, 2025
@roji
Copy link
Member Author

roji commented Feb 6, 2025

We should first do a cross-database comparison to validate that most/all databases have this capability (I'd be very surprised if that's not the case).

Beyond that, I'm tentatively proposing the name QueryAsync() for the new API. The verb Query (as opposed to Search, SimilaritySearch...) seems to connect quite well with the idea of traditional, keyword-only SQL filtering (e.g. "SQL queries"). This is incidentally also the naming distinction that Milvus makes (Search vs. Query).

@markwallace-microsoft markwallace-microsoft moved this to Sprint: Planned in Semantic Kernel Feb 10, 2025
@roji roji changed the title .Net MEVD: filter-only search API (without vector similarity) [MEVD] Filter-only search API (without vector similarity) Mar 11, 2025
@roji roji added the Build Features planned for next Build conference label Mar 12, 2025
@adamsitnik
Copy link
Member

I've done some quick research around the capabilities of the Vector DBs (for relational and NoSQL I assumed it simply has to be supported)

Short summary:

  • all seem to support the filter-only search
  • Pinecone is the only one that does not support sorting
  • Some require Top to be always provided, most use very low default values. So we need to make the top property required.
  • Qdrant is the only one that does not support Skip in a way that would be useful to us (we can specify offset, but it has to be the Id of the record to start with).

So I am going to start with an API that takes mandatory filter and Top property.

Connector Non-vector search OrderBy Top Skip API used .NET code
Azure AI Search Yes Yes Optional, defaults to 50, max 1k Optional, max is 100k https://learn.microsoft.com/en-us/rest/api/searchservice/search-documents#query-parameters azure-sdk-for-net/sdk/search/Azure.Search.Documents/src/Options/SearchOptions.cs at main · Azure/azure-sdk-for-net
Qdrant Yes Yes Default is 10. Not available. It supports Offset which is Id of the last record from previous batch Filtering - Qdrant qdrant-dotnet/src/Qdrant.Client/QdrantClient.cs at c1b3f1527eabbde1df984702d9b36d49c4fad151 · qdrant/qdrant-dotnet
Pinecone Should work (the vector property is nullable, docs are not clear) No Required. Not available, need to read more results and skip them manually https://docs.pinecone.io/guides/data/understanding-metadata#metadata-query-language pinecone-dotnet-client/src/Pinecone/Index/Requests/QueryRequest.cs at main · pinecone-io/pinecone-dotnet-client
Weaviate Should work (It's not clear if providing a search operator is mandatory (and they all refer to vectors)) Yes Optional, defaults to 10. Optional Object-level queries (Get) | Weaviate semantic-kernel/dotnet/src/Connectors/Connectors.Memory.Weaviate/HttpV2/WeaviateVectorSearchRequest.cs at main · microsoft/semantic-kernel
CosmosDbNoSql Yes Yes Optional Optional azure-cosmos-dotnet-v3/Microsoft.Azure.Cosmos/src/Query/v3Query/QueryDefinition.cs at master · Azure/azure-cosmos-dotnet-v3

@roji roji moved this from Sprint: Planned to Sprint: In Progress in Semantic Kernel Mar 21, 2025
@roji
Copy link
Member Author

roji commented Mar 21, 2025

Thanks for this Adam!

all seem to support the filter-only search

Great, that validates the need to expose this in the abstraction.

Pinecone is the only one that does not support sorting

When you say sorting/ordering, do you mean that databases simply allow picking a metadata property to sort by? Or is there more advanced support with expressions or something? If it's the former, then we'd just have a property selector (just like we do for selecting the vector property).

In any case, if Pinecone is the only one to not support this, it sounds like it definitely makes sense having it on the abstraction. I'm guessing it should be optional? If not specified, do databases just return random/undetermined ordering, or error?

Some require Top to be always provided, most use very low default values. So we need to make the top property required.

Yep, see #10193 for the same thing on the regular search API; whatever we decide to do should be done on both these search types (and also on the hybrid search API) for consistency. I'm happy for you to take over #10193 if you want - the main thing is making sure everyone agrees about the right thing here (/cc @westey-m @dmytrostruk).

Qdrant is the only one that does not support Skip in a way that would be useful to us (we can specify offset, but it has to be the Id of the record to start with).

That's an interesting one, and actually connects to general database questions about pagination efficiency; implementing pagination ("skipping") by giving an ID/key is actually typically the more efficient way to do things (see these docs, this post).

I'm curious why they chose to actually expose a Skip property on the API, given that you should already be able to simply do the same thing in the regular filter (i.e. Filter = r => r.Id > something). If there's some advantage in using the special Skip API (and I'm guessing there is, otherwise it wouldn't be there), we could go fancy and identify r.Id > something inside the filter, and extract that out, passing something to the Skip property (this is not quite as trivial as it sounds, e.g. when the comparison is nested in an OR we can't do it).

Maybe open an issue for us to look at this later, and throw for now if Skip is provided?

@adamsitnik
Copy link
Member

Update: The Vector property for Pincone QueryAsync is nullable, but only because either a vector or a vector ID must be provided.

There is a search_records API that we could use instead, but it's not exposed by the .NET client. I've created an issue: pinecone-io/pinecone-dotnet-client#43

For now I am using QueryAsync with a fake vector (just a vector full of zeros).

@adamsitnik
Copy link
Member

If not specified, do databases just return random/undetermined ordering, or error?

I don't know yet. I am just afraid that if we don't expose OrderBy(propertySelector), but do expose Skip and Top, users might end up getting the same record in multiple batches (as we don't get vector score and the results are not ordered by the score)

@roji
Copy link
Member Author

roji commented Mar 21, 2025

I am just afraid that if we don't expose OrderBy(propertySelector), but do expose Skip and Top, users might end up getting the same record in multiple batches (as we don't get vector score and the results are not ordered by the score)

Yep, that is how SQL works as well - you can specify LIMIT/OFFSET without ORDER BY, and get non-deterministic results. FWIW LIMIT can make sense without ORDER BY ("give me any 10 elements") but OFFSET not so much. EF warns for (most) such queries.

I don't think we should worry too much about this... Since almost all databases seem to support OrderBy, it seems obvious we should support that - if the user doesn't set it while using limit/offset, at the end of the day it's their problem...

@adamsitnik
Copy link
Member

All the connectors except of Pinceone, do support ordering the results. But unlike other connectors, Redis and Qdrant allow to sort only by a single property.

For now, I've implemented a very simple approach where the user might provide a single OrderBy selector and a boolean flag to set the order to be descending.

public sealed class QueryOptions<TRecord>
{
    public Expression<Func<TRecord, object?>>? OrderBy { get; init; }
    public bool SortAscending { get; init; }
}

await fixture.Collection.QueryAsync(new()
{
    Filter = filter,
    Top = top,
    OrderBy = r => r.Int // selecting a single property
}).ToListAsync());

An alternative would be to expose a func expression that takes a IQueryable<TRecord> and returns IOrderedQueryable<TRecord> and allow the users to chain OrderBy[Descending] and ThenBy[Descending] methods:

public sealed class QueryOptions<TRecord>
{
    public Expression<Func<IQueryable<TRecord>, IOrderedQueryable<TRecord>>>? OrderByExpression { get; init; }
}

await fixture.Collection.QueryAsync(new()
{
    Filter = filter,
    Top = top,
    OrderByExpression = input => input.OrderBy(x => x.Int).ThenByDescending(x => x.String)
}).ToListAsync());

Links to docs:

@roji let's sync and chat about it

@roji
Copy link
Member Author

roji commented Mar 26, 2025

@adamsitnik great work, thanks - definitely think together about it (@westey-m you'll probably be interested to).

BTW we also need to finalize the naming (QueryAsync) - IIRC @westey-m wasn't a fan; maybe let's do a meeting where we do a more holistic naming look; for the embedding generation integration changes, I plan on proposing replacing the existing APIs with SearchEmbeddingAsync (which accepts a raw embedding), vs. SearchAsync which would take a non-embedding. My main objective here is to make sure unsuspecting new users get to SearchAsync by default, and not to one of the more advanced/niche APIs.

On that note, @adamsitnik can you put together a quick overview of what the non-vector search APIs are called in the different databases? If there's a standard/tendency that could inform our naming decision here and help us decide...

@adamsitnik
Copy link
Member

can you put together a quick overview of what the non-vector search APIs are called in the different databases

The names of the APIs used are:

  • Azure AI Search: SearchAsync
  • MongoDB: FindAsync
  • CosmosDB: GetItemQueryIterator
  • Pinecone: search_records
  • Qdrant: ScrollAsync
  • Redis: SearchAsync
  • Weaviate: search

But most of them can be used for vector search as well (it's just a matter of different input for the same method)

@roji
Copy link
Member Author

roji commented Mar 27, 2025

@adamsitnik one more thought... Do databases - and especially those that support only a single ordering property - generally allow specifying ascending vs. descending? Or is that more a thing that the databases with multiple properties support?

@adamsitnik
Copy link
Member

adamsitnik commented Mar 27, 2025

@adamsitnik one more thought... Do databases - and especially those that support only a single ordering property - generally allow specifying ascending vs. descending? Or is that more a thing that the databases with multiple properties support?

They all (except of Pincone which does not support ordering at all) do support specifying ASC/DESC

@adamsitnik
Copy link
Member

Fixed by #11112

@roji roji moved this from Sprint: In Progress to Sprint: Done in Semantic Kernel Apr 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Build Features planned for next Build conference memory msft.ext.vectordata Related to Microsoft.Extensions.VectorData .NET Issue or Pull requests regarding .NET code sk team issue A tag to denote issues that where created by the Semantic Kernel team (i.e., not the community)
Projects
Status: Sprint: Done
Development

No branches or pull requests

5 participants