Skip to content

Commit 42230ef

Browse files
thoughts
1 parent 8cce1c4 commit 42230ef

File tree

1 file changed

+25
-43
lines changed

1 file changed

+25
-43
lines changed

src/libunicode/grapheme_line_segmenter.md

+25-43
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,34 @@
11

22
# Processing UTF-8 byte sequences into grapheme clusters
33

4-
## Motivation
5-
6-
// 👪👪👪
4+
## The Objective
5+
6+
👪👪👪
7+
8+
- Process a consecutive sequence text in UTF-8 encoding into a group of grapheme clusters.
9+
- Stop processing on one of the conditions:
10+
- end of input stream is reached
11+
- a control character (such as newline or escape character) has been found
12+
- the maximum number of grapheme clusters in narrow width (aka. page width) have been consumed (while wide characters count as two narrow characters)
13+
- Allow resuming processing text when we previously stopped in the middle of a grapheme cluster.
14+
- The algorithm must be as resource efficient as possible:
15+
- do not require any dynamic memory allocations during text processing
16+
- reduce instruction branching as much as possible
17+
- utilize SIMD to improve throughput performance
18+
- Invalid codepoints are treated with east asian width Narrow (1 column)
719

8-
We want to be able to pure scan text in the terminal, in order to be able to know what to
9-
print to the terminal's screen, but we must stop at the right page margin, as well as, when
10-
a control character (like newline or escape character) was found.
11-
The input is a consecutive memory region containing terminal output to be interpreted.
12-
Text is encoded in UTF-8.
20+
## Consequences
1321

14-
The ultimate goal is to scan text as fast as possible and stop scanning at one of the conditions:
22+
Do not report at the end of a codepoint, because maybe the following codepoint may extend
23+
the current grapheme cluster, thus, report only at the end of a complete grapheme cluster.
1524

16-
1. end of UTF-8 byte stream is reached
17-
2. a non-text character has been found (e.g. a control character)
18-
3. page width has been reached
25+
## Implementation
1926

20-
Scanning US-ASCII text is easy, trivial in fact. And US-ASCII can be even scanned using SIMD instructions,
21-
increasing scanning performance dramatically.
27+
Scanning US-ASCII can be easily implemented using SIMD, increasing scanning performance dramatically.
2228

2329
Scanning non-US-ASCII text, complex Unicode codepoints, is way more complex, because more depth is involved.
2430

25-
In order to satisfy point number 3 - stop scanning at the page width - we must take into account
31+
In order to reliably stop scanning at the page width - we must take into account
2632
that the character we see on the screen is not necessarily just a single byte,
2733
nor even a single UTF-32 codepoint, but rather a sequence of UTF-32 codepoints.
2834
This is what we call **grapheme cluster**. A grapheme cluster is a user perceived single grapheme entity,
@@ -31,36 +37,12 @@ that can be one or more Unicode codepoints.
3137
We therefore must be able to determine the border of when a grapheme cluster ends and the next one begins.
3238

3339
Because scanning US-ASCII text can be implemented using SIMD but complex Unicode cannot, we split both
34-
tasks into their own sub tasks, and then alter between the two in order to scan the sum of all Unicode.
35-
36-
In this article, we'll befocusing on scanning for complex unicode.
37-
38-
To make things even more complex, we also must be able to suspend and resume scanning at any arbitrary point
39-
in time, because we are not guaranteed to always have all bytes available. They may come in later calls.
40-
41-
## Objective
40+
tasks into their own sub tasks, and then alter between the two in order to scan the sum of all Unicode text.
4241

43-
Scan a sequence of UTF-8 bytes into grapheme clusters,
44-
emitting events for each grapheme cluster and their east asian widths,
45-
for up to a given amount of east asian widths (sum of each cluster's width),
46-
terminating also early on control characters, allowing to suspend and resume
47-
at any arbitrary point in the sequence of input bytes.
48-
49-
## Requirements
50-
51-
- The underlying input sequence to process at once is a consecutive sequence of bytes
52-
- East asian widths are mapped to terminal columns (Narrow=1, Wide=2)
53-
- Input is a consecutive sequence of bytes and the maximum number of total widths to process at most
54-
- Output is the number of widths being processed that fit into the input's maximum number of total widths
55-
- Invalid codepoints are treated with east asian width Narrow (1 column)
56-
- Processing up to a given amount of total widths
57-
- Processing can interrupt and resume at any time (like in a finite state machine)
58-
59-
## Consequences
42+
In this article, we'll befocusing on scanning for complex Unicode.
6043

61-
Do not report at the end of a codepoint,
62-
because maybe the following codepoint
63-
may extend the current grapheme cluster
44+
We also must be able to suspend and resume scanning text at any arbitrary point
45+
in time, because we are not guaranteed to always have all bytes available in a single call.
6446

6547
## Example Processing: Family Emoji
6648

0 commit comments

Comments
 (0)