URI Parser design change proposal for performance and stability #3193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

habbes wants to merge 5 commits into OData:main from habbes:uri-parser-rewrite-proposal

Contributor

habbes commented Mar 4, 2025

Issues

This pull request fixes #xxx.

Description

Briefly describe the changes of this pull request.

Checklist (Uncheck if it is not completed)

Test cases added
Build and test with one-click build and test script passed

Additional work necessary

If documentation update is needed, please add "Docs Needed" label to the issue and provide details about the required document change in the issue.

habbes added 5 commits

March 3, 2025 22:35


          Start URI parser rewrite proposal

3ff28c5


          Add notes on allocation-free parser

ea05d53


          AST generation

3b836a8


          Update title

16a5f7b


          Semantic binding

eb47505

habbes changed the title ~~URI Parser rewrite for performance and stability~~ URI Parser design change proposal for performance and stability

corranrogue9 reviewed

View reviewed changes

docs/uri-parser-rewriter-proposal.md Show resolved Hide resolved

corranrogue9 reviewed

View reviewed changes

docs/uri-parser-rewriter-proposal.md

+              ```csharp
+              interface IQueryNodeHandler<T>
+              {
+                  public T HandleIntNode(QueryNode intLiteralNode);

Contributor

corranrogue9 Mar 4, 2025

Is there a huge perf overhead for these methods to take in more specific types? Like, it would be great if we could make this IntQueryNode instead

Contributor

corranrogue9 Mar 4, 2025

Also, we have to be very careful about this because we will need to constantly add support for async, unsafe, methods that require parameterized context, allows ref struct, etc.

Contributor Author

habbes Mar 13, 2025

Based on your suggestions, I will investigate the cost of adding more specific types on top QueryNode for the different methods, e.g. Handle(IntQueryNode)

Where IntQueryNode could be defined as

struct IntQueryNode
{
     private QueryNode _node;
     public int GetIntValue() => _node.GetIntValue();
}

Maybe it could be optimized to a ref struct itself or it could reference the expression parser directly if such optimizations prove meaningful and practical.

corranrogue9 reviewed

View reviewed changes

docs/uri-parser-rewriter-proposal.md Show resolved Hide resolved

corranrogue9 reviewed

View reviewed changes

docs/uri-parser-rewriter-proposal.md

+              The key changes proposed:
+              - Avoid premature string allocations. Reference segments in the source string using `ReadOnlySpan<char>` until the user explicitly asks for a `string`.
+              - Store parsed nodes in array of structs instead of graph of objects.

Contributor

corranrogue9 Mar 4, 2025 •

edited

Loading

I want to be clear about this aspect and the terminology that we are using. By not having a tree, this is more akin to the readers that we have. This is not really "parsing" anymore. There's nothing wrong with this, I'm more having an issue with the terminology, and the terminology is important to me because (as you note later in the document), we don't differentiate between the different kinds of notes (in the same way that the current reader doesn't): the caller is expected to understand what kinds of nodes can come next and needs to be aware of this based on the context of the previous nodes.

corranrogue9 reviewed

View reviewed changes

docs/uri-parser-rewriter-proposal.md

+              ```csharp
+              internal struct ExpressionNode
+              {
+                  public ExpressionNodeKind Kind { get; internal set; }

Contributor

corranrogue9 Mar 4, 2025

If we use classes to model the enumeration instead of an enum, we could have a visitor on those classes which would help implement the code that generates the ref struct nodes that could be used in the handler that I suggested below

Contributor Author

habbes Mar 13, 2025 •

edited

Loading

We could compare both and see what the perf overhead is for each. But I assume the enum would have less overhead, it's smaller value, which could make it easier to pack the struct more densely and we don't have to cast.

Another I opted for an enum is that we could pack additional data in the enum. For example we could have a bit flag that tells us whether the item is a binary operator, another to tell us whether the string needs escaping or whether it could be returned as is (in which case we could return a span of the string to the user without allocating), etc. If we have an int enum we could use the lower 16 bits for the value and higher 16 bits for flags (we could also go for a smaller enum type depending on how many node kinds we'll have and whether alignment would make it worthwhile). Such flags could help optimize the common case. But I haven't yet done benchmarks to prove their worth.

Contributor

corranrogue9 Mar 19, 2025

We could compare both and see what the perf overhead is for each. But I assume the enum would have less overhead, it's smaller value, which could make it easier to pack the struct more densely and we don't have to cast.

The use of the visitor is actually what stops the need to cast, but your overall point certainly stands: this will be slower than an enum. I suppose I'm suggesting that between this change and the change for ihandler to take in more specific types, we will lose out slightly on performance but get 90% of the type safety benefits of a completely strongly-typed AST (for example, kind will now be expressed in a way that ensure we notice backwards compatibility issues, and ihandler would be expressed in a way that ensures callers don't have to do their own assertions). I understand if you don't take either suggestion, but they seems a very reasonable middle ground to me.

Another I opted for an enum is that we could pack additional data in the enum....

I don't know enough about it and can't get a good test to repro the perf benefits. Is there a benefit to doing this instead of using 2 short enums?

corranrogue9 reviewed

View reviewed changes

docs/uri-parser-rewriter-proposal.md

		@@ -0,0 +1,272 @@
		# OData URI Parser design change for higher performance and reliability

Contributor

corranrogue9 Mar 4, 2025

It was brought up in the meeting that we have the whole URI string in memory, so we should leverage that for performance gains instead of trying to overly generalize. Have you considered what it looks like for someone to compose an OData URI? For example, in the OData Client, we compose the URI as the caller tells us what data they are requesting. In that situation, we actually don't have the full URI anywhere in memory. Are we thinking we will have an entirely separate AST model for the client code to leverage?

Contributor Author

habbes Mar 13, 2025 •

edited

Loading

The client can either consume or produce OData URI. When it consumes OData URI, the URI is available in full, in memory. When it produces the URI, for example when translating LINQ expressions to an OData request, it generates the URI on the fly using a StringBuilder, the output of the translation will be string that's in memory, but even if it wasn't, this URI generation is not the role of the parser. The parser as described in this doc (and as implemented by the existing ODataUriParser) takes the URI as input, it does not generate it.

Contributor

corranrogue9 Mar 19, 2025

I guess my point is about the contract that the customer interfaces with. For OData Client, the customer gets a string output, and from the parser, the customer has a string input, so the "contract" that we are publishing to customers is that an OData URI is a string. I'll try to illustrate with something a bit contrived but that I think is reasonable:

A customer is using OData Client to generate a request. They then realize that they want to add an additional query option that the client doesn't support yet. So, they get the URI string from the client. At this point, they can't interact with our strongly-type URI code. If they parse the string, they will end up with types that can't be used to produce a new URI. So, the customer now has to either implement their own minimalized parser, or they will do simple string manipulation that is very error prone.

If instead the parser produced a type that can be added to (even something similar to how immutable collections work), then the customer could take the string from the client and parse it into something strongly typed, make their adjustments, and get a string back by walking the "AST". But if all of this is possible, then there's really no reason that the client can't just give the customer the same type that the parser produces. If we do this, the "contract" that our customers see is "ah, this AST-looking this is the type of OData URIs".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet