Skip to content

RegExp Processing of CharacterClassEscape for \W should be the same as !\w with unicode and ignoreCase flags. #516

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

msaboff
Copy link
Contributor

@msaboff msaboff commented Apr 4, 2016

The current specification of CharacterClassEscape's in Regular Expressions introduces surprising behavior of \W when both the unicode and ignoreCase flags are provided.

There are 6 CharacterClassEscapes defined:
\d is digit character and \D is not digit character
\s is space character and \S is not space character
\w is word character and \W is not word character
Furthermore, !\d matches the same as \D, !\s matches the same as \S, and !\w matches the same as \W with the exception of /\W/.ui and the characters 'k', 'K', 's' and 'S'.

These exceptions should be removed so that /\W/ui.test("k") should be false given that /\w/ui.test("k") is true. The same changes should mode for the other three subject characters. The proposed changes involve how the CharSet and invert flag are generated for CharacterSetMatcher for CharacterClassEscapes.

The proposed changes are also made for \D and \S CharacterClassEscapes for consistency, even though the changes to impacted what they match.

This change has been discussed in issue 512.

@msaboff
Copy link
Contributor Author

msaboff commented Apr 4, 2016

Fixed the issue that @mathiasbynens found.

@@ -28544,7 +28544,8 @@
</tbody>
</table>
</figure>
<p>The production <emu-grammar>CharacterClassEscape :: `W`</emu-grammar> evaluates by returning the set of all characters not included in the set returned by <emu-grammar>CharacterClassEscape :: `w`</emu-grammar> .</p>
<p>and the Boolean *false*.</p>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line still seems like it should be reverted?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The markup around this area looks really weird (closing </tbody> without a matching start tag etc.) but this line, at least, is intentional. It’s the continuation of the sentence on line https://github.com/tc39/ecma262/pull/516/files/1944b451b767e30575150204d0d28a8ae270a3a5#diff-3540caefa502006d8a33cb1385720803L28315.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to suggestions for a different place to add the "and the Boolean false." then after the table.

@msaboff
Copy link
Contributor Author

msaboff commented Apr 4, 2016

I have no idea why the change to line 28547 is only showing the one added line. This is what is shown in my local repository for the whole change - exactly what I'm proposing and pushed:

diff --git a/spec.html b/spec.html
index 55fe781..e5131f6 100644
--- a/spec.html
+++ b/spec.html
@@ -28110,8 +28110,8 @@ Date.parse(x.toLocaleString())
         </emu-alg>
         <p>The production <emu-grammar>AtomEscape :: CharacterClassEscape</emu-grammar> evaluates as follows:</p>
         <emu-alg>
-          1. Evaluate |CharacterClassEscape| to obtain a CharSet _A_.
-          1. Call CharacterSetMatcher(_A_, *false*) and return its Matcher result.
+          1. Evaluate |CharacterClassEscape| to obtain a CharSet _A_ and a Boolean _invert_.
+          1. Call CharacterSetMatcher(_A_, _invert_) and return its Matcher result.
         </emu-alg>
         <emu-note>
           <p>An escape sequence of the form `\\` followed by a nonzero decimal number _n_ matches the result of the _n_th set of capturing parentheses (see 0). It is an error if the regular expression has fewer than _n_ capturing parentheses. If the regular expression has _n_ or more capturing parentheses but the _n_th one is *undefined* because it has not captured anything, then the backreference always succeeds.</p>
@@ -28308,10 +28308,10 @@ Date.parse(x.toLocaleString())
       <!-- es6num="21.2.2.12" -->
       <emu-clause id="sec-characterclassescape">
         <h1>CharacterClassEscape</h1>
-        <p>The production <emu-grammar>CharacterClassEscape :: `d`</emu-grammar> evaluates by returning the ten-element set of characters containing the characters `0` through `9` inclusive.</p>
-        <p>The production <emu-grammar>CharacterClassEscape :: `D`</emu-grammar> evaluates by returning the set of all characters not included in the set returned by <emu-grammar>CharacterClassEscape :: `d`</emu-grammar> .</p>
-        <p>The production <emu-grammar>CharacterClassEscape :: `s`</emu-grammar> evaluates by returning the set of characters containing the characters that are on the right-hand side of the |WhiteSpace| or |LineTerminator| productions.</p>
-        <p>The production <emu-grammar>CharacterClassEscape :: `S`</emu-grammar> evaluates by returning the set of all characters not included in the set returned by <emu-grammar>CharacterClassEscape :: `s`</emu-grammar> .</p>
+        <p>The production <emu-grammar>CharacterClassEscape :: `d`</emu-grammar> evaluates by returning the ten-element set of characters containing the characters `0` through `9` inclusive and the Boolean *false*.</p>
+        <p>The production <emu-grammar>CharacterClassEscape :: `D`</emu-grammar> evaluates by returning the set returned by <emu-grammar>CharacterClassEscape :: `d`</emu-grammar> and the Boolean *true*.</p>
+        <p>The production <emu-grammar>CharacterClassEscape :: `s`</emu-grammar> evaluates by returning the set of characters containing the characters that are on the right-hand side of the |WhiteSpace| or |LineTerminator| productions and the Boolean *false*.</p>
+        <p>The production <emu-grammar>CharacterClassEscape :: `S`</emu-grammar> evaluates by returning the set returned by <emu-grammar>CharacterClassEscape :: `s`</emu-grammar> and the Boolean *true*.</p>
         <p>The production <emu-grammar>CharacterClassEscape :: `w`</emu-grammar> evaluates by returning the set of characters containing the sixty-three characters:</p>
         <figure>
           <table class="lightweight-table">
@@ -28544,7 +28544,8 @@ Date.parse(x.toLocaleString())
             </tbody>
           </table>
         </figure>
-        <p>The production <emu-grammar>CharacterClassEscape :: `W`</emu-grammar> evaluates by returning the set of all characters not included in the set returned by <emu-grammar>CharacterClassEscape :: `w`</emu-grammar> .</p>
+        <p>and the Boolean *false*.</p>
+        <p>The production <emu-grammar>CharacterClassEscape :: `W`</emu-grammar> evaluates by returning the set returned by <emu-grammar>CharacterClassEscape :: `w`</emu-grammar> and the Boolean *true*.</p>
       </emu-clause>

       <!-- es6num="21.2.2.13" -->

@hashseed
Copy link

hashseed commented Apr 5, 2016

This still does not address the issue with Class Range [0]. Part of evaluating a Class Range is Class Escape, in which, for the production ClassEscape::CharacterClassEscape we "return the CharSet that is the result of evaluating CharacterClassEscape". This CharSet may be used to create a union with another CharSet. The Boolean return value of the new CharacterClassEscape is now lost.

So with this Change, /[\W\d]/ would be equivalent to /[\w\D]/, since the Boolean return value is unused. If we are to use the Boolean value, we need a way to create a union for tuples of (CharSet, Boolean).

[0] https://tc39.github.io/ecma262/#sec-classranges
[1] https://tc39.github.io/ecma262/#sec-classescape

@msaboff
Copy link
Contributor Author

msaboff commented Apr 5, 2016

I do not have a good answer for the issue raised by @hashseed of CharacterClassEscapes being included in a CharacterClass.

Ignoring the processing of 'k' and 's' characters that this proposal was trying to address, the currently specified behavior of \w and \W for unicode RegExp's is not generally useful for other reasons. For example they do not properly recognize other unicode word characters such as accented characters.

Therefore I withdraw this request.

@msaboff msaboff closed this Apr 5, 2016
@littledan
Copy link
Member

Alright; we are all looking forward to your Unicode properties proposal, @goyaKim !

@msaboff
Copy link
Contributor Author

msaboff commented Apr 6, 2016

I created a new pull requests (525) that addresses this issue in another way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants