thoughts

christianparpart · christianparpart · commit 42230efac7ad · 2024-02-24T13:01:36.000+01:00
diff --git a/src/libunicode/grapheme_line_segmenter.md b/src/libunicode/grapheme_line_segmenter.md
@@ -1,28 +1,34 @@
 
 # Processing UTF-8 byte sequences into grapheme clusters
 
-## Motivation
-
-// 👪👪👪
+## The Objective
+
+👪👪👪
+
+- Process a consecutive sequence text in UTF-8 encoding into a group of grapheme clusters.
+- Stop processing on one of the conditions:
+  - end of input stream is reached
+  - a control character (such as newline or escape character) has been found
+  - the maximum number of grapheme clusters in narrow width (aka. page width) have been consumed (while wide characters count as two narrow characters)
+- Allow resuming processing text when we previously stopped in the middle of a grapheme cluster.
+- The algorithm must be as resource efficient as possible:
+  - do not require any dynamic memory allocations during text processing
+  - reduce instruction branching as much as possible
+  - utilize SIMD to improve throughput performance
+- Invalid codepoints are treated with east asian width Narrow (1 column)
 
-We want to be able to pure scan text in the terminal, in order to be able to know what to
-print to the terminal's screen, but we must stop at the right page margin, as well as, when
-a control character (like newline or escape character) was found.
-The input is a consecutive memory region containing terminal output to be interpreted.
-Text is encoded in UTF-8.
+## Consequences
 
-The ultimate goal is to scan text as fast as possible and stop scanning at one of the conditions:
+Do not report at the end of a codepoint, because maybe the following codepoint may extend 
+the current grapheme cluster, thus, report only at the end of a complete grapheme cluster.
 
-1. end of UTF-8 byte stream is reached
-2. a non-text character has been found (e.g. a control character)
-3. page width has been reached
+## Implementation
 
-Scanning US-ASCII text is easy, trivial in fact. And US-ASCII can be even scanned using SIMD instructions,
-increasing scanning performance dramatically.
+Scanning US-ASCII can be easily implemented using SIMD, increasing scanning performance dramatically.
 
 Scanning non-US-ASCII text, complex Unicode codepoints, is way more complex, because more depth is involved.
 
-In order to satisfy point number 3 - stop scanning at the page width - we must take into account
+In order to reliably stop scanning at the page width - we must take into account
 that the character we see on the screen is not necessarily just a single byte,
 nor even a single UTF-32 codepoint, but rather a sequence of UTF-32 codepoints.
 This is what we call **grapheme cluster**. A grapheme cluster is a user perceived single grapheme entity,
@@ -31,36 +37,12 @@ that can be one or more Unicode codepoints.
 We therefore must be able to determine the border of when a grapheme cluster ends and the next one begins.
 
 Because scanning US-ASCII text can be implemented using SIMD but complex Unicode cannot, we split both
-tasks into their own sub tasks, and then alter between the two in order to scan the sum of all Unicode.
-
-In this article, we'll befocusing on scanning for complex unicode.
-
-To make things even more complex, we also must be able to suspend and resume scanning at any arbitrary point
-in time, because we are not guaranteed to always have all bytes available. They may come in later calls.
-
-## Objective
+tasks into their own sub tasks, and then alter between the two in order to scan the sum of all Unicode text.
 
-Scan a sequence of UTF-8 bytes into grapheme clusters,
-emitting events for each grapheme cluster and their east asian widths,
-for up to a given amount of east asian widths (sum of each cluster's width),
-terminating also early on control characters, allowing to suspend and resume
-at any arbitrary point in the sequence of input bytes.
-
-## Requirements
-
-- The underlying input sequence to process at once is a consecutive sequence of bytes
-- East asian widths are mapped to terminal columns (Narrow=1, Wide=2)
-- Input is a consecutive sequence of bytes and the maximum number of total widths to process at most
-- Output is the number of widths being processed that fit into the input's maximum number of total widths
-- Invalid codepoints are treated with east asian width Narrow (1 column)
-- Processing up to a given amount of total widths
-- Processing can interrupt and resume at any time (like in a finite state machine)
-
-## Consequences
+In this article, we'll befocusing on scanning for complex Unicode.
 
-Do not report at the end of a codepoint,
-because maybe the following codepoint
-may extend the current grapheme cluster
+We also must be able to suspend and resume scanning text at any arbitrary point
+in time, because we are not guaranteed to always have all bytes available in a single call.
 
 ## Example Processing: Family Emoji