RegExp Processing of CharacterClassEscape for \W should be the same as !\w with unicode and ignoreCase flags. #516

msaboff · 2016-04-04T18:40:42Z

The current specification of CharacterClassEscape's in Regular Expressions introduces surprising behavior of \W when both the unicode and ignoreCase flags are provided.

There are 6 CharacterClassEscapes defined:
\d is digit character and \D is not digit character
\s is space character and \S is not space character
\w is word character and \W is not word character
Furthermore, !\d matches the same as \D, !\s matches the same as \S, and !\w matches the same as \W with the exception of /\W/.ui and the characters 'k', 'K', 's' and 'S'.

These exceptions should be removed so that /\W/ui.test("k") should be false given that /\w/ui.test("k") is true. The same changes should mode for the other three subject characters. The proposed changes involve how the CharSet and invert flag are generated for CharacterSetMatcher for CharacterClassEscapes.

The proposed changes are also made for \D and \S CharacterClassEscapes for consistency, even though the changes to impacted what they match.

This change has been discussed in issue 512.

msaboff · 2016-04-04T20:22:48Z

Fixed the issue that @mathiasbynens found.

ljharb · 2016-04-04T20:26:32Z

spec.html

@@ -28544,7 +28544,8 @@
            </tbody>
          </table>
        </figure>
-        <p>The production <emu-grammar>CharacterClassEscape :: `W`</emu-grammar> evaluates by returning the set of all characters not included in the set returned by <emu-grammar>CharacterClassEscape :: `w`</emu-grammar> .</p>
+        <p>and the Boolean *false*.</p>


this line still seems like it should be reverted?

The markup around this area looks really weird (closing </tbody> without a matching start tag etc.) but this line, at least, is intentional. It’s the continuation of the sentence on line https://github.com/tc39/ecma262/pull/516/files/1944b451b767e30575150204d0d28a8ae270a3a5#diff-3540caefa502006d8a33cb1385720803L28315.

I'm open to suggestions for a different place to add the "and the Boolean false." then after the table.

msaboff · 2016-04-04T21:53:53Z

I have no idea why the change to line 28547 is only showing the one added line. This is what is shown in my local repository for the whole change - exactly what I'm proposing and pushed:

diff --git a/spec.html b/spec.html
index 55fe781..e5131f6 100644
--- a/spec.html
+++ b/spec.html
@@ -28110,8 +28110,8 @@ Date.parse(x.toLocaleString())
         </emu-alg>
         <p>The production <emu-grammar>AtomEscape :: CharacterClassEscape</emu-grammar> evaluates as follows:</p>
         <emu-alg>
-          1. Evaluate |CharacterClassEscape| to obtain a CharSet _A_.
-          1. Call CharacterSetMatcher(_A_, *false*) and return its Matcher result.
+          1. Evaluate |CharacterClassEscape| to obtain a CharSet _A_ and a Boolean _invert_.
+          1. Call CharacterSetMatcher(_A_, _invert_) and return its Matcher result.
         </emu-alg>
         <emu-note>
           <p>An escape sequence of the form `\\` followed by a nonzero decimal number _n_ matches the result of the _n_th set of capturing parentheses (see 0). It is an error if the regular expression has fewer than _n_ capturing parentheses. If the regular expression has _n_ or more capturing parentheses but the _n_th one is *undefined* because it has not captured anything, then the backreference always succeeds.</p>
@@ -28308,10 +28308,10 @@ Date.parse(x.toLocaleString())
       <!-- es6num="21.2.2.12" -->
       <emu-clause id="sec-characterclassescape">
         <h1>CharacterClassEscape</h1>
-        <p>The production <emu-grammar>CharacterClassEscape :: `d`</emu-grammar> evaluates by returning the ten-element set of characters containing the characters `0` through `9` inclusive.</p>
-        <p>The production <emu-grammar>CharacterClassEscape :: `D`</emu-grammar> evaluates by returning the set of all characters not included in the set returned by <emu-grammar>CharacterClassEscape :: `d`</emu-grammar> .</p>
-        <p>The production <emu-grammar>CharacterClassEscape :: `s`</emu-grammar> evaluates by returning the set of characters containing the characters that are on the right-hand side of the |WhiteSpace| or |LineTerminator| productions.</p>
-        <p>The production <emu-grammar>CharacterClassEscape :: `S`</emu-grammar> evaluates by returning the set of all characters not included in the set returned by <emu-grammar>CharacterClassEscape :: `s`</emu-grammar> .</p>
+        <p>The production <emu-grammar>CharacterClassEscape :: `d`</emu-grammar> evaluates by returning the ten-element set of characters containing the characters `0` through `9` inclusive and the Boolean *false*.</p>
+        <p>The production <emu-grammar>CharacterClassEscape :: `D`</emu-grammar> evaluates by returning the set returned by <emu-grammar>CharacterClassEscape :: `d`</emu-grammar> and the Boolean *true*.</p>
+        <p>The production <emu-grammar>CharacterClassEscape :: `s`</emu-grammar> evaluates by returning the set of characters containing the characters that are on the right-hand side of the |WhiteSpace| or |LineTerminator| productions and the Boolean *false*.</p>
+        <p>The production <emu-grammar>CharacterClassEscape :: `S`</emu-grammar> evaluates by returning the set returned by <emu-grammar>CharacterClassEscape :: `s`</emu-grammar> and the Boolean *true*.</p>
         <p>The production <emu-grammar>CharacterClassEscape :: `w`</emu-grammar> evaluates by returning the set of characters containing the sixty-three characters:</p>
         <figure>
           <table class="lightweight-table">
@@ -28544,7 +28544,8 @@ Date.parse(x.toLocaleString())
             </tbody>
           </table>
         </figure>
-        <p>The production <emu-grammar>CharacterClassEscape :: `W`</emu-grammar> evaluates by returning the set of all characters not included in the set returned by <emu-grammar>CharacterClassEscape :: `w`</emu-grammar> .</p>
+        <p>and the Boolean *false*.</p>
+        <p>The production <emu-grammar>CharacterClassEscape :: `W`</emu-grammar> evaluates by returning the set returned by <emu-grammar>CharacterClassEscape :: `w`</emu-grammar> and the Boolean *true*.</p>
       </emu-clause>

       <!-- es6num="21.2.2.13" -->

hashseed · 2016-04-05T05:43:05Z

This still does not address the issue with Class Range [0]. Part of evaluating a Class Range is Class Escape, in which, for the production ClassEscape::CharacterClassEscape we "return the CharSet that is the result of evaluating CharacterClassEscape". This CharSet may be used to create a union with another CharSet. The Boolean return value of the new CharacterClassEscape is now lost.

So with this Change, /[\W\d]/ would be equivalent to /[\w\D]/, since the Boolean return value is unused. If we are to use the Boolean value, we need a way to create a union for tuples of (CharSet, Boolean).

[0] https://tc39.github.io/ecma262/#sec-classranges
[1] https://tc39.github.io/ecma262/#sec-classescape

msaboff · 2016-04-05T20:03:55Z

I do not have a good answer for the issue raised by @hashseed of CharacterClassEscapes being included in a CharacterClass.

Ignoring the processing of 'k' and 's' characters that this proposal was trying to address, the currently specified behavior of \w and \W for unicode RegExp's is not generally useful for other reasons. For example they do not properly recognize other unicode word characters such as accented characters.

Therefore I withdraw this request.

littledan · 2016-04-06T20:38:49Z

Alright; we are all looking forward to your Unicode properties proposal, @goyaKim !

msaboff · 2016-04-06T22:32:02Z

I created a new pull requests (525) that addresses this issue in another way.

Proposed RegExp CharacterClassEscape changes for \W/ui

1179cb8

msaboff mentioned this pull request Apr 4, 2016

RegExp processing unicode+ignoreCase of \W is not the same as !\w when matching "S" or "K" #512

Closed

Minor corrections.

1944b45

ljharb reviewed Apr 4, 2016
View reviewed changes

msaboff closed this Apr 5, 2016

msaboff mentioned this pull request Apr 6, 2016

Unify handling of RegExp CharacterClassEscapes \w and \W and Word Asserts \b and \B #525

Merged

mathiasbynens mentioned this pull request Apr 7, 2016

\p{…} and \P{…} mathiasbynens/es-regexp-unicode-character-class-escapes#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RegExp Processing of CharacterClassEscape for \W should be the same as !\w with unicode and ignoreCase flags. #516

RegExp Processing of CharacterClassEscape for \W should be the same as !\w with unicode and ignoreCase flags. #516

Uh oh!

msaboff commented Apr 4, 2016

Uh oh!

msaboff commented Apr 4, 2016

Uh oh!

ljharb Apr 4, 2016

Uh oh!

mathiasbynens Apr 4, 2016

Uh oh!

msaboff Apr 4, 2016

Uh oh!

msaboff commented Apr 4, 2016

Uh oh!

hashseed commented Apr 5, 2016

Uh oh!

msaboff commented Apr 5, 2016

Uh oh!

littledan commented Apr 6, 2016

Uh oh!

msaboff commented Apr 6, 2016

Uh oh!

Uh oh!

RegExp Processing of CharacterClassEscape for \W should be the same as !\w with unicode and ignoreCase flags. #516

RegExp Processing of CharacterClassEscape for \W should be the same as !\w with unicode and ignoreCase flags. #516

Uh oh!

Conversation

msaboff commented Apr 4, 2016

Uh oh!

msaboff commented Apr 4, 2016

Uh oh!

ljharb Apr 4, 2016

Choose a reason for hiding this comment

Uh oh!

mathiasbynens Apr 4, 2016

Choose a reason for hiding this comment

Uh oh!

msaboff Apr 4, 2016

Choose a reason for hiding this comment

Uh oh!

msaboff commented Apr 4, 2016

Uh oh!

hashseed commented Apr 5, 2016

Uh oh!

msaboff commented Apr 5, 2016

Uh oh!

littledan commented Apr 6, 2016

Uh oh!

msaboff commented Apr 6, 2016

Uh oh!

Uh oh!