-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve RegExp tokenization in Adblock syntax to ensure consistency and resolve escaping issues #121
Comments
$
symbol in the value of some special modifiers
Our parsing logic fundamentally needs to handle three key aspects:
Tokenization StrategyI suggest breaking any input string into the following tokens:
For cosmetic rules, we need to extend tokenization with the following additional tokens:
Example: Token Processing for Network Rules
Separating Modifiers from Each OtherTo correctly separate modifiers, we must be aware of which modifiers can have regex values (or must always be regex), such as
Handling Special Cases in Modifier SeparationSeparating modifiers can sometimes be tricky. For example, in cosmetic rules, the
This leads to cases like: [$path=/foo,domain=/example.(com|org)/]##.ad In this case, using the "find the next slash token" strategy would be incorrect because we would mistakenly parse The "find the next slash token" strategy also not enough for [$domain=/example\d{1,}\.(com|org)/|notregex.com]##.ad OtherOnce we add support for uBO's hostname regexp, we should handle the following case: /example\d{1,}\.(com|org)/##.ad
! ↑
! not a domain separator |
A frequently recurring issue that often leads to misunderstandings is that, in the current adblock syntax, certain characters need to be escaped in regular expressions to ensure the correct parsing of Adblock rules. Moreover, these escaping rules are inconsistent, because we need to escape different characters according to the actual context.
For example, in cosmetic rule modifiers, the
]
character must be escaped:However, not escaping the
$
character in cosmetic rule modifiers is not an issue since it practically follows the[
at the beginning of the rule.On the other hand, in network rule modifiers, the
]
character does not need to be escaped because it has no special meaning there. However, the$
character must be escaped, as it serves as the separator between the network rule pattern and its modifiers. For example:For this reason, I propose that we develop a tokenization logic that allows the use of regular expressions without requiring escaping any character in any context, wherever possible.
To achieve this, we need to collect as many edge cases as we can and design an appropriate algorithm. It is crucial to ensure that
NetworkRuleParser
andModifierListParser
do not redundantly perform the same computations. In simple cases (e.g., where no$
character is present in network rules), the number of checks should be minimized.The text was updated successfully, but these errors were encountered: