1
1
2
2
# Processing UTF-8 byte sequences into grapheme clusters
3
3
4
- ## Motivation
5
-
6
- // 👪👪👪
4
+ ## The Objective
5
+
6
+ 👪👪👪
7
+
8
+ - Process a consecutive sequence text in UTF-8 encoding into a group of grapheme clusters.
9
+ - Stop processing on one of the conditions:
10
+ - end of input stream is reached
11
+ - a control character (such as newline or escape character) has been found
12
+ - the maximum number of grapheme clusters in narrow width (aka. page width) have been consumed (while wide characters count as two narrow characters)
13
+ - Allow resuming processing text when we previously stopped in the middle of a grapheme cluster.
14
+ - The algorithm must be as resource efficient as possible:
15
+ - do not require any dynamic memory allocations during text processing
16
+ - reduce instruction branching as much as possible
17
+ - utilize SIMD to improve throughput performance
18
+ - Invalid codepoints are treated with east asian width Narrow (1 column)
7
19
8
- We want to be able to pure scan text in the terminal, in order to be able to know what to
9
- print to the terminal's screen, but we must stop at the right page margin, as well as, when
10
- a control character (like newline or escape character) was found.
11
- The input is a consecutive memory region containing terminal output to be interpreted.
12
- Text is encoded in UTF-8.
20
+ ## Consequences
13
21
14
- The ultimate goal is to scan text as fast as possible and stop scanning at one of the conditions:
22
+ Do not report at the end of a codepoint, because maybe the following codepoint may extend
23
+ the current grapheme cluster, thus, report only at the end of a complete grapheme cluster.
15
24
16
- 1 . end of UTF-8 byte stream is reached
17
- 2 . a non-text character has been found (e.g. a control character)
18
- 3 . page width has been reached
25
+ ## Implementation
19
26
20
- Scanning US-ASCII text is easy, trivial in fact. And US-ASCII can be even scanned using SIMD instructions,
21
- increasing scanning performance dramatically.
27
+ Scanning US-ASCII can be easily implemented using SIMD, increasing scanning performance dramatically.
22
28
23
29
Scanning non-US-ASCII text, complex Unicode codepoints, is way more complex, because more depth is involved.
24
30
25
- In order to satisfy point number 3 - stop scanning at the page width - we must take into account
31
+ In order to reliably stop scanning at the page width - we must take into account
26
32
that the character we see on the screen is not necessarily just a single byte,
27
33
nor even a single UTF-32 codepoint, but rather a sequence of UTF-32 codepoints.
28
34
This is what we call ** grapheme cluster** . A grapheme cluster is a user perceived single grapheme entity,
@@ -31,36 +37,12 @@ that can be one or more Unicode codepoints.
31
37
We therefore must be able to determine the border of when a grapheme cluster ends and the next one begins.
32
38
33
39
Because scanning US-ASCII text can be implemented using SIMD but complex Unicode cannot, we split both
34
- tasks into their own sub tasks, and then alter between the two in order to scan the sum of all Unicode.
35
-
36
- In this article, we'll befocusing on scanning for complex unicode.
37
-
38
- To make things even more complex, we also must be able to suspend and resume scanning at any arbitrary point
39
- in time, because we are not guaranteed to always have all bytes available. They may come in later calls.
40
-
41
- ## Objective
40
+ tasks into their own sub tasks, and then alter between the two in order to scan the sum of all Unicode text.
42
41
43
- Scan a sequence of UTF-8 bytes into grapheme clusters,
44
- emitting events for each grapheme cluster and their east asian widths,
45
- for up to a given amount of east asian widths (sum of each cluster's width),
46
- terminating also early on control characters, allowing to suspend and resume
47
- at any arbitrary point in the sequence of input bytes.
48
-
49
- ## Requirements
50
-
51
- - The underlying input sequence to process at once is a consecutive sequence of bytes
52
- - East asian widths are mapped to terminal columns (Narrow=1, Wide=2)
53
- - Input is a consecutive sequence of bytes and the maximum number of total widths to process at most
54
- - Output is the number of widths being processed that fit into the input's maximum number of total widths
55
- - Invalid codepoints are treated with east asian width Narrow (1 column)
56
- - Processing up to a given amount of total widths
57
- - Processing can interrupt and resume at any time (like in a finite state machine)
58
-
59
- ## Consequences
42
+ In this article, we'll befocusing on scanning for complex Unicode.
60
43
61
- Do not report at the end of a codepoint,
62
- because maybe the following codepoint
63
- may extend the current grapheme cluster
44
+ We also must be able to suspend and resume scanning text at any arbitrary point
45
+ in time, because we are not guaranteed to always have all bytes available in a single call.
64
46
65
47
## Example Processing: Family Emoji
66
48
0 commit comments