-
Notifications
You must be signed in to change notification settings - Fork 10
Who else is exploring exposing sequences in regular expressions? #12
Comments
I haven’t seen any similar proposals. AFAIK this is entirely new ground. I’ve made the Unicode Consortium aware of this proposal. If the proposal eventually makes it to Stage 4, it would certainly be nice to have the Unicode consortium’s blessing by means of updating UTS18 with an optional recommendation to implement this functionality. All that is out of scope for this particular repository/proposal, though. |
There's a limited form in Perl: https://ideone.com/iO8kBg #!/usr/bin/perl
use utf8;
binmode STDOUT, ":utf8";
my $re = qr/\N{LATIN SMALL LETTER A WITH MACRON AND GRAVE}/u;
my @cases = ("\x{0101}", "\x{0300}", "\x{0101}\x{0300}");
for my $str (@cases) {
if ( $str =~ $re ) {
print ("matches $re: $str\n");
} else {
print ("does not match $re: $str\n");
}
}
|
Are there any sequence properties which aren't pictograph-related? I couldn't help noticing every property under "Proposed Solution" was related to Emoji. I'm curious if there are any other potential properties which are or might one day be used. Could this syntax be extended to cover Unicode decomposition? Or is that something for another proposal? Something like this: const before = /[1₁➊①]/; // Huge list of anything that decomposes to U+0031
const after = /1/d; // “Decompose” flag
after.match("¹"); // ⇒ true
after.match("1"); // ⇒ true Basically, it'd replicate how most contemporary browsers use “smart” matching when searching for "1" on a page (which matches Unicode sequences which have an equivalent decomposition mapping). |
@Alhadis That sounds like a separate proposal to me. @nathanhammond To your original question, I've been discussing this proposal with Unicode folks at Google, and more recently have created a formal proposal at the Unicode level, which has now been officially submitted. Mark Davis will present the proposal during the January UTC meeting at Google MTV from January 22–25. If the proposal gets accepted, we'll know what the official term is for sequence properties. One of the things the proposal addresses is the idea of this functionality not being JavaScript-only. If the proposal ends up getting accepted, it would explicitly mention sequence properties in UTS#18, enabling us to proceed with this proposal knowing we have the Unicode Consortium's blessing, and enabling other languages to consider adopting this functionality. |
@mathiasbynens I've just read the proposal and I really appreciate you doing the upstream legwork in Unicode land. If those changes land upstream I would be in favor of advancing this proposal. In my head I'm still generally uneasy with the idea of heading down a path for standardizing this for JS without the PCRE folks. However, I also feel that TC39 (you) have been paying close attention to this space recently and JS is therefore well-suited to be the first one to deliver. |
Update: Unicode has agreed to formalize “properties of strings”: https://www.unicode.org/reports/tr18/#domain_of_properties The recommendation for which properties to support in regular expressions has been updated to include the 7 binary properties of strings: https://www.unicode.org/reports/tr18/#Full_Properties → “Properties marked with * are properties of strings, not just single code points.” These are the 7 emoji properties of strings defines in UTS #51 and its data files. |
Those kinds of things are possible, and are discussed in UTS 18: https://www.unicode.org/reports/tr18/#Wildcard_Properties They are actually typically fancy ways of creating sets of characters, not of strings. You can see that in the UnicodeSet demo: But other functions like this could yield sets of strings. And as Richard pointed out, All of these are out of scope for the current proposal.
The browsers I am aware of implement “ctrl+F” in-page search via collation (e.g., using ICU's class StringSearch), which is a very different kind of algorithm (UCA=UTS 10, and CLDR extensions). |
Is TC39 the only group exploring directly exposing emoji sequences in regular expressions?
The text was updated successfully, but these errors were encountered: