Improve RegExp tokenization in Adblock syntax to ensure consistency and resolve escaping issues #121

scripthunter7 · 2024-03-27T11:12:54Z

A frequently recurring issue that often leads to misunderstandings is that, in the current adblock syntax, certain characters need to be escaped in regular expressions to ensure the correct parsing of Adblock rules. Moreover, these escaping rules are inconsistent, because we need to escape different characters according to the actual context.

For example, in cosmetic rule modifiers, the ] character must be escaped:

[$domain=/example[0-9]\.(com|org)/]##.ad
!                    ↑
!                    this closing square bracket is falsely considered as the end of the modifier block

[$domain=/example\d{1,}\.(com|org)/]##.ad
!                    ↑
!                    this comma is falsely considered as a modifier separator

However, not escaping the $ character in cosmetic rule modifiers is not an issue since it practically follows the [ at the beginning of the rule.

On the other hand, in network rule modifiers, the ] character does not need to be escaped because it has no special meaning there. However, the $ character must be escaped, as it serves as the separator between the network rule pattern and its modifiers. For example:

||www.amazon.$removeparam=/^[a-z_]{1,20}=[a-zA-Z0-9._-]{80,}$/
!            ↑                                              ↑
!      Rule separator                                Regexp end marker
!                                                          
!

For this reason, I propose that we develop a tokenization logic that allows the use of regular expressions without requiring escaping any character in any context, wherever possible.

To achieve this, we need to collect as many edge cases as we can and design an appropriate algorithm. It is crucial to ensure that NetworkRuleParser and ModifierListParser do not redundantly perform the same computations. In simple cases (e.g., where no $ character is present in network rules), the number of checks should be minimized.

The text was updated successfully, but these errors were encountered:

scripthunter7 · 2025-03-12T18:47:29Z

Our parsing logic fundamentally needs to handle three key aspects:

The left side of cosmetic rules, which is the part to the left of the marker (e.g., ##, etc.). We need to determine the start and end of the modifier list and ensure the modifiers are properly separated.
The entire network rule, where we need to identify the pattern, the separator character, and the modifiers.
The right side of the network rule, which is the part to the right of the $ separator. Here, modifiers must be correctly separated from each other.

Tokenization Strategy

I suggest breaking any input string into the following tokens:

<whitespace-token>: A sequence of any unescaped whitespace characters.
<slash-token>: An unescaped / character.
<dollar-token>: An unescaped $ character.
<equals-token>: An unescaped = character.
<comma-token>: An unescaped , character.
<word-token>: A sequence of any unescaped ASCII letters, digits, or - characters. These typically appear in modifier names, such as 3p, third-party, removeparam, etc.
<other-token>: Any other sequence of characters that do not fit into the categories above.
<eof-token>: A token indicating the end of the input.

For cosmetic rules, we need to extend tokenization with the following additional tokens:

<left-bracket-token>: An unescaped [ character.
<right-bracket-token>: An unescaped ] character.

Example: Token Processing for Network Rules

Trivial case: If there is no $-token in the string, then the entire string is treated as a pattern.
- Example:
```
||example.com^
```
If the string contains a single $-token, followed by a <word-token><whitespace-token>* (<equals-token>|<eof-token>), then the $-token is definitely the separator character.
- Example cases:
```
||example.com^$3p
||example.com^$domain=example.org
```
- Otherwise, the entire string is treated as a pattern, and the $-token is part of the pattern.
- Example cases:
```
||example.com/$rpc/
/^https?:\/\/example\.com\/product\/\$[0-9]{5}$/
```
- However, we must consider that this method incorrectly treats ||example.com/$rpc as if rpc were a modifier.
If the input string contains multiple $-tokens, a more detailed check is required (but since this is a rarer case, it does not significantly slow down the parser).
1. Identify the actual separator character:
  - This is most likely the first $-token followed by a <word-token><whitespace-token>* (<equals-token>|<eof-token>) token sequence.
  - If multiple candidates exist, the correct separator is the last one that is not preceded by a <word-token><whitespace-token>*<equals-token> token sequence where <word-token> represents a known modifier whose value can be a regex.
2. Properly separate the modifiers.

Separating Modifiers from Each Other

To correctly separate modifiers, we must be aware of which modifiers can have regex values (or must always be regex), such as domain.

When handling regex-capable modifiers:
- If a <word-token> matches a known modifier name and is followed by <whitespace-token>*<equals-token><whitespace-token>*<slash-token>,
  - We must find the next <slash-token>, which should mark the end of the modifier value.
  - This second <slash-token> must then be followed by either <whitespace-token>*<comma-token> or <whitespace-token>*<eof-token>, otherwise, a parse error should be raised.
For non-regex modifiers:
- We only need to locate the last token that is followed by <whitespace-token>*<comma-token> or <whitespace-token>*<eof-token>.

Handling Special Cases in Modifier Separation

Separating modifiers can sometimes be tricky. For example, in cosmetic rules, the $path modifier is special because its value can be either:

A regular expression, or
A simple path.

This leads to cases like:

[$path=/foo,domain=/example.(com|org)/]##.ad

In this case, using the "find the next slash token" strategy would be incorrect because we would mistakenly parse $path=/foo,domain=/ as the value of the path modifier. Therefore, special handling is needed for $path, ensuring that its value is correctly interpreted as /foo rather than a regex.

The "find the next slash token" strategy also not enough for domain, because its value is a pipe separated list:

[$domain=/example\d{1,}\.(com|org)/|notregex.com]##.ad

Other

Once we add support for uBO's hostname regexp, we should handle the following case:

/example\d{1,}\.(com|org)/##.ad
!           ↑
!   not a domain separator

adguard-bot assigned maximtop Mar 27, 2024

adguard-bot added the Priority: P4 label Mar 27, 2024

scripthunter7 added the T: AGTree label Mar 27, 2024

scripthunter7 changed the title ~~Allow using unescaped $ symbol in the value of some special modifiers~~ Eliminate the need for extra escapes in regular expressions Sep 16, 2024

scripthunter7 mentioned this issue Sep 16, 2024

Add support for $removeparam modifier AdguardTeam/VscodeAdblockSyntax#41

Closed

scripthunter7 changed the title ~~Eliminate the need for extra escapes in regular expressions~~ Improve RegExp tokenization in Adblock syntax to ensure consistency and resolve escaping issues Mar 12, 2025

This was referenced Mar 12, 2025

Cosmetic rule modifiers with some special regexp values are parsed incorrectly #127

Closed

Regexp rule with $domain modifier doesn't work in 4.4 extension when the length of a domain is specified #136

Closed

scripthunter7 added the enhancement New feature or request label Mar 12, 2025

AdamWr mentioned this issue Mar 22, 2025

xyzsports111.xyz: ads uBlockOrigin/uAssets#22274

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve RegExp tokenization in Adblock syntax to ensure consistency and resolve escaping issues #121

Improve RegExp tokenization in Adblock syntax to ensure consistency and resolve escaping issues #121

scripthunter7 commented Mar 27, 2024 •

edited

Loading

scripthunter7 commented Mar 12, 2025 •

edited

Loading

Improve RegExp tokenization in Adblock syntax to ensure consistency and resolve escaping issues #121

Improve RegExp tokenization in Adblock syntax to ensure consistency and resolve escaping issues #121

Comments

scripthunter7 commented Mar 27, 2024 • edited Loading

scripthunter7 commented Mar 12, 2025 • edited Loading

Tokenization Strategy

Example: Token Processing for Network Rules

Separating Modifiers from Each Other

Handling Special Cases in Modifier Separation

Other

scripthunter7 commented Mar 27, 2024 •

edited

Loading

scripthunter7 commented Mar 12, 2025 •

edited

Loading