-
Notifications
You must be signed in to change notification settings - Fork 359
URI Parser design change proposal for performance and stability #3193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
```csharp | ||
interface IQueryNodeHandler<T> | ||
{ | ||
public T HandleIntNode(QueryNode intLiteralNode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a huge perf overhead for these methods to take in more specific types? Like, it would be great if we could make this IntQueryNode
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we have to be very careful about this because we will need to constantly add support for async
, unsafe
, methods that require parameterized context, allows ref struct
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on your suggestions, I will investigate the cost of adding more specific types on top QueryNode
for the different methods, e.g. Handle(IntQueryNode)
Where IntQueryNode
could be defined as
struct IntQueryNode
{
private QueryNode _node;
public int GetIntValue() => _node.GetIntValue();
}
Maybe it could be optimized to a ref struct itself or it could reference the expression parser directly if such optimizations prove meaningful and practical.
The key changes proposed: | ||
|
||
- Avoid premature string allocations. Reference segments in the source string using `ReadOnlySpan<char>` until the user explicitly asks for a `string`. | ||
- Store parsed nodes in array of structs instead of graph of objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to be clear about this aspect and the terminology that we are using. By not having a tree, this is more akin to the readers that we have. This is not really "parsing" anymore. There's nothing wrong with this, I'm more having an issue with the terminology, and the terminology is important to me because (as you note later in the document), we don't differentiate between the different kinds of notes (in the same way that the current reader doesn't): the caller is expected to understand what kinds of nodes can come next and needs to be aware of this based on the context of the previous nodes.
```csharp | ||
internal struct ExpressionNode | ||
{ | ||
public ExpressionNodeKind Kind { get; internal set; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use class
es to model the enumeration instead of an enum
, we could have a visitor on those classes which would help implement the code that generates the ref struct
nodes that could be used in the handler that I suggested below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could compare both and see what the perf overhead is for each. But I assume the enum would have less overhead, it's smaller value, which could make it easier to pack the struct more densely and we don't have to cast.
Another I opted for an enum is that we could pack additional data in the enum. For example we could have a bit flag that tells us whether the item is a binary operator, another to tell us whether the string needs escaping or whether it could be returned as is (in which case we could return a span of the string to the user without allocating), etc. If we have an int enum we could use the lower 16 bits for the value and higher 16 bits for flags (we could also go for a smaller enum type depending on how many node kinds we'll have and whether alignment would make it worthwhile). Such flags could help optimize the common case. But I haven't yet done benchmarks to prove their worth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could compare both and see what the perf overhead is for each. But I assume the enum would have less overhead, it's smaller value, which could make it easier to pack the struct more densely and we don't have to cast.
The use of the visitor is actually what stops the need to cast, but your overall point certainly stands: this will be slower than an enum. I suppose I'm suggesting that between this change and the change for ihandler
to take in more specific types, we will lose out slightly on performance but get 90% of the type safety benefits of a completely strongly-typed AST (for example, kind
will now be expressed in a way that ensure we notice backwards compatibility issues, and ihandler
would be expressed in a way that ensures callers don't have to do their own assertions). I understand if you don't take either suggestion, but they seems a very reasonable middle ground to me.
Another I opted for an enum is that we could pack additional data in the enum....
I don't know enough about it and can't get a good test to repro the perf benefits. Is there a benefit to doing this instead of using 2 short
enums?
@@ -0,0 +1,272 @@ | |||
# OData URI Parser design change for higher performance and reliability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was brought up in the meeting that we have the whole URI string in memory, so we should leverage that for performance gains instead of trying to overly generalize. Have you considered what it looks like for someone to compose an OData URI? For example, in the OData Client, we compose the URI as the caller tells us what data they are requesting. In that situation, we actually don't have the full URI anywhere in memory. Are we thinking we will have an entirely separate AST model for the client code to leverage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The client can either consume or produce OData URI. When it consumes OData URI, the URI is available in full, in memory. When it produces the URI, for example when translating LINQ expressions to an OData request, it generates the URI on the fly using a StringBuilder, the output of the translation will be string that's in memory, but even if it wasn't, this URI generation is not the role of the parser. The parser as described in this doc (and as implemented by the existing ODataUriParser
) takes the URI as input, it does not generate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my point is about the contract that the customer interfaces with. For OData Client, the customer gets a string
output, and from the parser, the customer has a string
input, so the "contract" that we are publishing to customers is that an OData URI is a string
. I'll try to illustrate with something a bit contrived but that I think is reasonable:
A customer is using OData Client to generate a request. They then realize that they want to add an additional query option that the client doesn't support yet. So, they get the URI string from the client. At this point, they can't interact with our strongly-type URI code. If they parse the string, they will end up with types that can't be used to produce a new URI. So, the customer now has to either implement their own minimalized parser, or they will do simple string manipulation that is very error prone.
If instead the parser produced a type that can be added to (even something similar to how immutable collections work), then the customer could take the string from the client and parse it into something strongly typed, make their adjustments, and get a string back by walking the "AST". But if all of this is possible, then there's really no reason that the client can't just give the customer the same type that the parser produces. If we do this, the "contract" that our customers see is "ah, this AST-looking this is the type of OData URIs".
Issues
This pull request fixes #xxx.
Description
Briefly describe the changes of this pull request.
Checklist (Uncheck if it is not completed)
Additional work necessary
If documentation update is needed, please add "Docs Needed" label to the issue and provide details about the required document change in the issue.