Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve RegExp tokenization in Adblock syntax to ensure consistency and resolve escaping issues #121

Open
scripthunter7 opened this issue Mar 27, 2024 · 1 comment
Assignees
Labels

Comments

@scripthunter7
Copy link
Member

scripthunter7 commented Mar 27, 2024

A frequently recurring issue that often leads to misunderstandings is that, in the current adblock syntax, certain characters need to be escaped in regular expressions to ensure the correct parsing of Adblock rules. Moreover, these escaping rules are inconsistent, because we need to escape different characters according to the actual context.

For example, in cosmetic rule modifiers, the ] character must be escaped:

[$domain=/example[0-9]\.(com|org)/]##.ad
!                    ↑
!                    this closing square bracket is falsely considered as the end of the modifier block
[$domain=/example\d{1,}\.(com|org)/]##.ad
!                    ↑
!                    this comma is falsely considered as a modifier separator

However, not escaping the $ character in cosmetic rule modifiers is not an issue since it practically follows the [ at the beginning of the rule.

On the other hand, in network rule modifiers, the ] character does not need to be escaped because it has no special meaning there. However, the $ character must be escaped, as it serves as the separator between the network rule pattern and its modifiers. For example:

||www.amazon.$removeparam=/^[a-z_]{1,20}=[a-zA-Z0-9._-]{80,}$/
!            ↑                                              ↑
!      Rule separator                                Regexp end marker
!                                                          
!                                             

For this reason, I propose that we develop a tokenization logic that allows the use of regular expressions without requiring escaping any character in any context, wherever possible.

To achieve this, we need to collect as many edge cases as we can and design an appropriate algorithm. It is crucial to ensure that NetworkRuleParser and ModifierListParser do not redundantly perform the same computations. In simple cases (e.g., where no $ character is present in network rules), the number of checks should be minimized.

@scripthunter7 scripthunter7 changed the title Allow using unescaped $ symbol in the value of some special modifiers Eliminate the need for extra escapes in regular expressions Sep 16, 2024
@scripthunter7 scripthunter7 changed the title Eliminate the need for extra escapes in regular expressions Improve RegExp tokenization in Adblock syntax to ensure consistency and resolve escaping issues Mar 12, 2025
@scripthunter7
Copy link
Member Author

scripthunter7 commented Mar 12, 2025

Our parsing logic fundamentally needs to handle three key aspects:

  1. The left side of cosmetic rules, which is the part to the left of the marker (e.g., ##, etc.). We need to determine the start and end of the modifier list and ensure the modifiers are properly separated.
  2. The entire network rule, where we need to identify the pattern, the separator character, and the modifiers.
  3. The right side of the network rule, which is the part to the right of the $ separator. Here, modifiers must be correctly separated from each other.

Tokenization Strategy

I suggest breaking any input string into the following tokens:

  • <whitespace-token>: A sequence of any unescaped whitespace characters.
  • <slash-token>: An unescaped / character.
  • <dollar-token>: An unescaped $ character.
  • <equals-token>: An unescaped = character.
  • <comma-token>: An unescaped , character.
  • <word-token>: A sequence of any unescaped ASCII letters, digits, or - characters. These typically appear in modifier names, such as 3p, third-party, removeparam, etc.
  • <other-token>: Any other sequence of characters that do not fit into the categories above.
  • <eof-token>: A token indicating the end of the input.

For cosmetic rules, we need to extend tokenization with the following additional tokens:

  • <left-bracket-token>: An unescaped [ character.
  • <right-bracket-token>: An unescaped ] character.

Example: Token Processing for Network Rules

  1. Trivial case: If there is no $-token in the string, then the entire string is treated as a pattern.

    • Example:
      ||example.com^
  2. If the string contains a single $-token, followed by a <word-token><whitespace-token>* (<equals-token>|<eof-token>), then the $-token is definitely the separator character.

    • Example cases:
      ||example.com^$3p
      ||example.com^$domain=example.org
    • Otherwise, the entire string is treated as a pattern, and the $-token is part of the pattern.
    • Example cases:
      ||example.com/$rpc/
      /^https?:\/\/example\.com\/product\/\$[0-9]{5}$/
    • However, we must consider that this method incorrectly treats ||example.com/$rpc as if rpc were a modifier.
  3. If the input string contains multiple $-tokens, a more detailed check is required (but since this is a rarer case, it does not significantly slow down the parser).

    1. Identify the actual separator character:
      • This is most likely the first $-token followed by a <word-token><whitespace-token>* (<equals-token>|<eof-token>) token sequence.
      • If multiple candidates exist, the correct separator is the last one that is not preceded by a <word-token><whitespace-token>*<equals-token> token sequence where <word-token> represents a known modifier whose value can be a regex.
    2. Properly separate the modifiers.

Separating Modifiers from Each Other

To correctly separate modifiers, we must be aware of which modifiers can have regex values (or must always be regex), such as domain.

  1. When handling regex-capable modifiers:

    • If a <word-token> matches a known modifier name and is followed by <whitespace-token>*<equals-token><whitespace-token>*<slash-token>,
      • We must find the next <slash-token>, which should mark the end of the modifier value.
      • This second <slash-token> must then be followed by either <whitespace-token>*<comma-token> or <whitespace-token>*<eof-token>, otherwise, a parse error should be raised.
  2. For non-regex modifiers:

    • We only need to locate the last token that is followed by <whitespace-token>*<comma-token> or <whitespace-token>*<eof-token>.

Handling Special Cases in Modifier Separation

Separating modifiers can sometimes be tricky. For example, in cosmetic rules, the $path modifier is special because its value can be either:

  • A regular expression, or
  • A simple path.

This leads to cases like:

[$path=/foo,domain=/example.(com|org)/]##.ad

In this case, using the "find the next slash token" strategy would be incorrect because we would mistakenly parse $path=/foo,domain=/ as the value of the path modifier. Therefore, special handling is needed for $path, ensuring that its value is correctly interpreted as /foo rather than a regex.

The "find the next slash token" strategy also not enough for domain, because its value is a pipe separated list:

[$domain=/example\d{1,}\.(com|org)/|notregex.com]##.ad

Other

Once we add support for uBO's hostname regexp, we should handle the following case:

/example\d{1,}\.(com|org)/##.ad
!           ↑
!   not a domain separator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants