-
Notifications
You must be signed in to change notification settings - Fork 359
URI Parser design change proposal for performance and stability #3193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,272 @@ | ||
# OData URI Parser design change for higher performance and reliability | ||
|
||
This is a proposal to rewrite the OData URI parser stack from the ground up based on a different architecture. The primary motivation for this proposal is performance, | ||
but it also aims to make improvements to usability and developer experience. | ||
|
||
## Motivations | ||
|
||
### Performance | ||
|
||
This proposal is part of larger effort to turn OData into a high-performance library. The URI parser is a key component of the OData experience as parsing the request URL is the first | ||
step to handling and OData request. We have made efforts aimed at improving performance, the most recent being the [work by `xuhg` to use `ReadOnlySpan<char>` to cut down | ||
string allocations](https://github.com/OData/odata.net/pull/3077). While we have seen improvements in such efforts, we're limited by the current architecture and public-facing APIs. | ||
If we entertain a breaking change, we can change the architecture and redesign the parser for better performance and low resource use. | ||
|
||
The aim of this effort to reach at least 5x improvement in performance and reduction in allocations on the core URI parser on benchmarks that demonstrate end-to-end semantic parsing of common URL patterns. | ||
|
||
### Usability | ||
|
||
The OData URI has two main parts, the path segments and query options. There different parsers for the path and the query options. | ||
In general, each component has two parsing phases: syntactic parse and semantic parse. The syntactic parse is supposed to return a tree or sequence of nodes that represent the structure without binding model information. The semantic pass adds type information and validation from the model. However, the separation between syntax and semantic parse isn't cleanly implemented in all cases: | ||
|
||
- The `SelectExpandSyntactiParser` which performs syntactic parsing of the select and expand query options is internal, meaning that end users cannot access it. It also has a dependency on `IEdmStructuredType`, meaning the model is required to use it. Customers have expressed interest in parsing URLs without the model. | ||
- The path parser with parses the path segments simply splits the string on `/` slash symbols. As a result it doesn't properly handle cases where the keys in the path may contain slashes. [A comment in the code](https://github.com/OData/odata.net/blob/main/src/Microsoft.OData.Core/UriParser/Parsers/UriPathParser.cs#L55) suggests that this is a known bug. | ||
- The `UriQueryOptionsParser` with does a lexical parse of some query options like `$filter` reads sequences delimited by `(` and `)` as single string. This means that items of literal collections cannot be individually consumed in the syntactic phase. These are properly parsed later in the semantic binder (for example, `FilterBinder`) where a model is required. This binding is also not ideal. For example, when parsing an `in` or `has` expression, the collection [literal is parsed by replacing the delimiting `(` and `)` with `[` and `]` and then calling the JsonReader to parse the array](https://github.com/OData/odata.net/blob/main/src/Microsoft.OData.Core/UriParser/Binders/InBinder.cs#L119). This "hack" is the cause of several bugs and incompalities related to the `in` operator, such as: | ||
- [](https://github.com/OData/odata.net/pull/3190) | ||
- https://github.com/OData/odata.net/issues/2875 | ||
- https://github.com/OData/odata.net/issues/3098 | ||
- https://github.com/OData/odata.net/pull/2711 | ||
- [Source comment acknowleging flaws of this approach](https://github.com/OData/odata.net/blob/main/src/Microsoft.OData.Core/Uri/ODataUriConversionUtils.cs#L677). | ||
|
||
The rewrite provides an opportunity to parse each expression properly from the lexical tokens up to the semantic nodes. | ||
|
||
## High-level architectural changes | ||
|
||
The key changes proposed: | ||
|
||
- Avoid premature string allocations. Reference segments in the source string using `ReadOnlySpan<char>` until the user explicitly asks for a `string`. | ||
- Store parsed nodes in array of structs instead of graph of objects. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I want to be clear about this aspect and the terminology that we are using. By not having a tree, this is more akin to the readers that we have. This is not really "parsing" anymore. There's nothing wrong with this, I'm more having an issue with the terminology, and the terminology is important to me because (as you note later in the document), we don't differentiate between the different kinds of notes (in the same way that the current reader doesn't): the caller is expected to understand what kinds of nodes can come next and needs to be aware of this based on the context of the previous nodes. |
||
- Frugal use of space, for example pack data in bit flags where applicable instead of separate bool fields. This helps keep the structs small and less expensive to move around. | ||
|
||
Using an array of node structs is a common approach for memory-efficient parsers like `System.Text.Json`. | ||
|
||
### Impact on customer-facing APIs and usability | ||
|
||
A key side-effect of this approach is that user-facing nodes will not be class instances, but structs. This means no inhertiance. The developer will not rely on `is` operator or visitors based on polymorphism to process nodes. | ||
They will rely on enums to distinguish between different node kinds and use methods such as `TryGetString()` or `TryGetInt()` or `EnumerateChildren()` to extract content from the nodes. This is similar to the `JsonDocument` and `JsonElement` APIs in `System.Text.Json`. We may consider adding a higher-level layer of class-based APIs if they provide significant value. | ||
|
||
## Proof-concept and experiments | ||
|
||
[This repository](https://github.com/habbes/experiments/tree/master/ODataSlimUrlParserConcept) contains the source code, benchmarks and experiments used in the proof-of-concept to demonstrate the feasibility of this proposal. | ||
|
||
## Components | ||
|
||
This section describes the proposal in more detail for each of the key components of the parser. The focus is on the query options parser, and the `$filter` query option in particular because of its complexity relative to other query options like `$select` and `$expand`. We'll also mention the path parser for completeness. | ||
|
||
### Query options parser | ||
|
||
The `ODataQueryOptionsParser` class is responsible for parsing the various query options in OData URL. `ODataQueryOptionParser` is not a single parser, but a wrapper around is different of query options. | ||
This proposal focuses on `$filter` query options due to the relative complexity of its syntax compared to other query options like `$select` or `$expand`. Furthermore, there are similarities between | ||
parsing `$filter` and other query options like `$apply` and `$orderby`. | ||
|
||
Parsing a `$filter` expressions happen in three phases in a pipeline where the output of one phase is the input of the next. The follow table summarizes the phases and components that implement them: | ||
|
||
| Phase | Component | Description | | ||
|--------|-----------|-------------| | ||
| Tokenization | `ExpressionLexer` | Generates sequence of tokens from input string. At this phase we can destinguish between a keyword, a literal and an identifier. We can detect some syntax errors like unbalanced parantheses or unexpected end of input.| | ||
| Abstract Syntax Tree | `UriQueryExpressionParser` | Generates a tree that represents the query expression. Each node in the tree is represent by a subclass of `QueryToken` class. At this stage we can tell the structure of an `or` operand and correctly order the operands based on precedence. We don't know the types of expressions. We can detect a function call and its parameters, but we can't tell the type or whether it exists. | ||
| Semantic binding | `FilterBinder` | Attaches type information from the model to nodes in the AST. The nodes in the "semantic tree" are subclasses of `QueryNode`. At this stage we can tell the return type of function or property access. We can tell whether a property exists or not. | ||
|
||
I propose to keep the same phases, but change how data is stored in memory for better efficiency, and ensure that each phase is "complete" (for example, don't perform tokenization of array elements in the semantic binding stage). | ||
|
||
#### Allocation-free lexer | ||
|
||
The current `ExpressionLexer` is an internal class that keeps a reference to the input string with the following key methods and properties: | ||
|
||
- `NextToken()` reads the next token and advances the lexer to it. | ||
- `TryPeekNextToken()` reads the next token but does not advance to it. This adds support for looking ahead when that's necessary to resolve some ambiguity. | ||
- `CurrentToken` the current token. This returns an `ExpressionToken` which is a `struct` representing the current token. We use `ReadOnlyMemory` to reference the token text without allocating. | ||
|
||
How the lexer is used: iteratively call `NextToken()` and check `CurrentToken` until you reach the end of input. | ||
|
||
The proposed lexer to is very different from the existing, except for the following key differences: | ||
|
||
- The lexer is a `ref struct` instead of a class. `ref struct` instead of plain `struct` because it has a `ReadOnlySpan<char>` field. This means it can only be used on the stack. | ||
- The lexer holds a `ReadOnlySpan<char>` reference to the input Text | ||
- The `CurrentToken` property returns an `Token` struct. | ||
- The `ExpressionToken` struct has a `Range` property of type `ValueRange`. `ValueRange` is a struct that contains | ||
- The lexer will read tokens from paranthesized expressions like collections. This will address a lot of errors we currently face based on our `in` implementation. | ||
|
||
This parser does not allocate any objects to the heap as demonstrated by the following [benchmark](https://github.com/habbes/experiments/blob/master/ODataSlimUrlParserConcept/Benchmarks/ParserBenchmarks.cs#L102) | ||
from the input string `"category eq 'electronics' or price gt 100"`. | ||
|
||
The benchmarks compare parsing the following filter expression: `"category eq 'electronics' or price gt 100"`. | ||
|
||
```md | ||
BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2314) | ||
Intel Xeon W-2123 CPU 3.60GHz, 1 CPU, 8 logical and 4 physical cores | ||
.NET SDK 9.0.100 | ||
[Host] : .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX-512F+CD+BW+DQ+VL | ||
DefaultJob : .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX-512F+CD+BW+DQ+VL | ||
``` | ||
|
||
| Method | Mean | Error | StdDev | Gen0 | Allocated | | ||
|----------------------------------------- |-------------:|-----------:|-----------:|-------:|----------:| | ||
| **ParseExpression_SlimExpressionLexer** | 90.21 ns | 1.688 ns | 1.579 ns | - | - | | ||
|
||
See a sample implementation here: https://github.com/habbes/experiments/blob/master/ODataSlimUrlParserConcept/Lib/ExpressionLexer.cs. | ||
|
||
See a sample usage here: https://github.com/habbes/experiments/blob/master/ODataSlimUrlParserConcept/SlimParserTests/ExpressionLexerTests.cs | ||
|
||
#### Expression parsers and AST generation | ||
|
||
The current `UriQueryExpressionParser` recursively parses a sequence of token into a tree based on the `QueryToken` class. | ||
|
||
- Different subclasses of `QueryToken` represent different types of node. | ||
- We allocate substrings of the input to store text values in the `QueryToken`s. | ||
- Binary operator nodes have a reference to the left and right children and `QueryToken`. | ||
- The `LiteralToken` is used to represent, among others, parenthesized expressions (e.g. collection expressions). The expression is still in unparsed string form. | ||
|
||
This is where we propose a drastic change. Instead of nodes with direct reference to each other, we store nodes in a flat array. | ||
|
||
A node would have roughly the following structure: | ||
|
||
```csharp | ||
internal struct ExpressionNode | ||
{ | ||
public ExpressionNodeKind Kind { get; internal set; } | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could compare both and see what the perf overhead is for each. But I assume the enum would have less overhead, it's smaller value, which could make it easier to pack the struct more densely and we don't have to cast. Another I opted for an enum is that we could pack additional data in the enum. For example we could have a bit flag that tells us whether the item is a binary operator, another to tell us whether the string needs escaping or whether it could be returned as is (in which case we could return a span of the string to the user without allocating), etc. If we have an int enum we could use the lower 16 bits for the value and higher 16 bits for flags (we could also go for a smaller enum type depending on how many node kinds we'll have and whether alignment would make it worthwhile). Such flags could help optimize the common case. But I haven't yet done benchmarks to prove their worth. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The use of the visitor is actually what stops the need to cast, but your overall point certainly stands: this will be slower than an enum. I suppose I'm suggesting that between this change and the change for
I don't know enough about it and can't get a good test to repro the perf benefits. Is there a benefit to doing this instead of using 2 |
||
public ValueRange Range { get; internal set; } | ||
internal int Count { get; set; } | ||
internal int LastChild { get; set; } | ||
public bool IsTerminal => Count == -1; | ||
} | ||
``` | ||
|
||
And the expression parser will look roughly as follows: | ||
|
||
```csharp | ||
public ExpressionParser | ||
{ | ||
internal readonly ReadOnlyMemory<char> _source; | ||
internal NodeList<ExpressionNode> _nodes; | ||
|
||
public static QueryNode Parse(ReadOnlyMemory<char> source) | ||
{ | ||
var lexer = new ExpressionLexer(_source.Span); | ||
int root = ParseExpression(ref lexer, ref _nodes, 0); | ||
return QueryNode(this, root); | ||
} | ||
} | ||
``` | ||
|
||
The `ExpressioNode` is an internal representation of any kind of node in the "tree". This struct is not exposed publicly. | ||
Instead, we return a `QueryNode` struct which has the following structure: | ||
|
||
```csharp | ||
public struct QueryNode | ||
corranrogue9 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
{ | ||
private int _index { get; set; } | ||
private ExpressionParser _parser { get; set; } | ||
|
||
public ExpressionNodeKind Kind => _parser[_index].Kind; | ||
public int GetIntValue() | ||
{ | ||
return int.TryParse(_parser[_index].Range.GetValue(_parser.Source)); | ||
corranrogue9 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
|
||
public string GetStringValue(); | ||
public ReadOnlySpan<char> GetSpanValue(); | ||
|
||
public int CollectionLength(); | ||
public QueryNode GetItemAt(int index); | ||
public Enumerator EnumeratorItems(); | ||
public QueryNode GetLeft(); | ||
public QueryNode GetRight(); | ||
} | ||
``` | ||
|
||
The public `QueryNode` struct is smaller than the internal `ExpressionNode` and would be cheaper to move around. | ||
|
||
`ExpressionNode` can represent terminal nodes like primitives and identifiers, binary operators with a left and right child | ||
and collections with an arbitrary number of children. It can also represent nodes with nested children. | ||
|
||
- The `Count` property represents the number of top-level children a node has. The `LastChild` is the index of the last child in the array. | ||
- We use `0` for the Count to represent terminal nodes | ||
- The first child will always be the at parent's index + 1. | ||
|
||
For nodes with non-nested children, we can easily find out in O(1): | ||
|
||
- whether it has children, via the `Count` property | ||
- the index of the child (`index + 1`) | ||
- the index of any child (since we know where each child item will be) | ||
|
||
However, if children are nested (e.g. an `or` node that has a left `and` node), it's harder to tell children are located. In this case, | ||
we have to iterate through each child and skip over nested children to enumerate the top-level children. We can set the most significant bit to 1 | ||
to tell whether a node has "complex" or "nested" childs. If this bit is 0, we use the O(1) approach to find children by index, if the bit is 1 (i.e. `Count < 0`) | ||
we use the iterative approach. | ||
|
||
Let's use the following query to demonstrate how the array will be laid out: | ||
|
||
```text | ||
category in ('electronics', substring(field, 1), "technolgy") | ||
``` | ||
|
||
Here's a tree representation of the query: | ||
|
||
 | ||
|
||
**Note** This is not how the current `UriQueryExpressionParser` represents this tree. Notably, it does not parse the the collection operand of and the `in` operator. | ||
|
||
Here's how this would look like in the node array: | ||
|
||
 | ||
|
||
The downside of this approach is that there could be multiple nesting levels between two siblings of the same parent. That would require multiple hops of to skip over from one child to its next sibling. The number of hops is proportional to the level of nesting. If we have highly nested collections, this could noticeably slow down the iteration or indexing beyond O(n). If we anticipate highly nested collections, we could reduce the number of hops to one per sibling by storing more metadata about each child and level (TODO: demonstrate this). | ||
|
||
How does the end user process this tree? Since we're using structs, we can't provide a traditional visitor that's based on inheritance. The basic approach is for the user to simply traverse the tree using the APIs on the `QueryNode` struct, like they would any other DOM-like structure. But this may be tedious for some cases. We can propose a handler that exposes a different method based on the type of node for convenience, e.g.: | ||
|
||
```csharp | ||
interface IQueryNodeHandler<T> | ||
{ | ||
public T HandleIntNode(QueryNode intLiteralNode); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a huge perf overhead for these methods to take in more specific types? Like, it would be great if we could make this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, we have to be very careful about this because we will need to constantly add support for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Based on your suggestions, I will investigate the cost of adding more specific types on top Where struct IntQueryNode
{
private QueryNode _node;
public int GetIntValue() => _node.GetIntValue();
} Maybe it could be optimized to a ref struct itself or it could reference the expression parser directly if such optimizations prove meaningful and practical. |
||
public T HandleInNode(QueryNode inNode); | ||
public T HandleCollectionNode(QueryNode collection); | ||
} | ||
``` | ||
|
||
#### Semantinc binding | ||
|
||
In the semantic node phase, we would start with a type that represent's the `IEdmType` of the path then traverse the expression starting from the route and bind a semantinc infomation like the `IEdmType` to each node. This semantic data could be stored in a separate array of the same size as the expression node array. Each index of the array will contain metadata that corresponds to the | ||
|
||
### Path parser | ||
|
||
The current path parser is implemented in two phases: | ||
|
||
1. [The `UriPathParser`](https://github.com/OData/odata.net/blob/main/src/Microsoft.OData.Core/UriParser/Parsers/UriPathParser.cs) splits the path into an `ICollection<string>` of segments using `/` as the separator. It doesn't check for presence of `/` charaters inside of string literals in key segments. | ||
2. The internal [ODataPathParser](https://github.com/OData/odata.net/blob/main/src/Microsoft.OData.Core/UriParser/Parsers/ODataPathParser.cs) goes through each segment and binds to it a schema element and type. This steps returns an `IList<ODataPathSegment>` where `ODataPathSegment` in abstract class. Each type of segment is a subclass of `ODataPathSegment`, for example: `EntitySetSegment`, `EntityIdSegment`, etc. | ||
|
||
Here are the proposed changes: | ||
|
||
1. The lexer should not split the string. It should be an iterator that successively get the next each time `Read()` is called. Each token as span into the input string. There should be no heap allocations in this phase. We can also properly handle slash characters inside string literals. | ||
2. The semantic parser progressively advances the tokenizer creating a semantic node for each segment. The semantic node would have roughly the following shape: | ||
|
||
```csharp | ||
internal readonly struct PathSegmentNode | ||
{ | ||
public PathSegmentKind Kind { get; } | ||
// index of the last child, for example if this is a function with args or a compound key. | ||
// So paths don't have nested children, we can easily compute the child based on the current index | ||
// and index of the last child. | ||
int LastChild { get; } | ||
public ValueRange Range { get } | ||
public IEdmElement EdmElement { get;} // Should we make edm-element lookup lazy? | ||
} | ||
``` | ||
|
||
These nodes don't have a direct reference to the string, the range just contains the start and end, but don't point to the actual string. This makes the nodes flexible to use with either `ReadOnlySpan<char>` or `ReadOnlyMemory<char>`. | ||
The `PathSegmentNode` is not exposed to the user, it needs to be linked to the parent list to be usable, it doesn't have enough context to be used as a standalone node. The is also relatively large and we don't want to incur the cost of direct copies. | ||
|
||
The parsed path would look like: | ||
|
||
```csharp | ||
public sealed class PathSegmentList : IReadOnlyList<PathSegment> | ||
{ | ||
private PathSegmentNode[] _nodes; | ||
|
||
public PathSegment this[int index] => return new PathSegment(this, index); | ||
} | ||
``` | ||
|
||
The `PathSegment` is the public-facing type representing a path segment. It references the index of the `PathSegmentNode` in the list and exposes | ||
method to extract data from the node based on its type. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was brought up in the meeting that we have the whole URI string in memory, so we should leverage that for performance gains instead of trying to overly generalize. Have you considered what it looks like for someone to compose an OData URI? For example, in the OData Client, we compose the URI as the caller tells us what data they are requesting. In that situation, we actually don't have the full URI anywhere in memory. Are we thinking we will have an entirely separate AST model for the client code to leverage?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The client can either consume or produce OData URI. When it consumes OData URI, the URI is available in full, in memory. When it produces the URI, for example when translating LINQ expressions to an OData request, it generates the URI on the fly using a StringBuilder, the output of the translation will be string that's in memory, but even if it wasn't, this URI generation is not the role of the parser. The parser as described in this doc (and as implemented by the existing
ODataUriParser
) takes the URI as input, it does not generate it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my point is about the contract that the customer interfaces with. For OData Client, the customer gets a
string
output, and from the parser, the customer has astring
input, so the "contract" that we are publishing to customers is that an OData URI is astring
. I'll try to illustrate with something a bit contrived but that I think is reasonable:A customer is using OData Client to generate a request. They then realize that they want to add an additional query option that the client doesn't support yet. So, they get the URI string from the client. At this point, they can't interact with our strongly-type URI code. If they parse the string, they will end up with types that can't be used to produce a new URI. So, the customer now has to either implement their own minimalized parser, or they will do simple string manipulation that is very error prone.
If instead the parser produced a type that can be added to (even something similar to how immutable collections work), then the customer could take the string from the client and parse it into something strongly typed, make their adjustments, and get a string back by walking the "AST". But if all of this is possible, then there's really no reason that the client can't just give the customer the same type that the parser produces. If we do this, the "contract" that our customers see is "ah, this AST-looking this is the type of OData URIs".