Skip to content

Add a Utf8CharDecoder type for incrementally decoding UTF-8 byte-by-byte #55

Open
@jwodder

Description

@jwodder
#[derive(Clone, Debug, Eq, PartialEq)]
struct Utf8CharDecoder(...);

impl Utf8CharDecoder {
    fn new() -> Self { todo!() }
    fn feed(&mut self, byte: u8) -> Option<Result<char, SomeError>> { todo!() }
    fn finish(self) -> Result<(), SomeError> { todo!() }
    fn reset(&mut self) { todo!() }
    // Something for decomposing into the inner partial bytes
    // Something for testing whether there are currently partial bytes stored
    // Something for querying how many continuation bytes are needed to complete the current character
}
  • Also add a lossy variant

  • Error conditions that the error type must cover:

    • Codepoint is greater than 0x10FFFF
    • Codepoint is a surrogate character
    • Non-canonical encoding of codepoint (e.g., 0b1100_0000 0b1000_0000 for the NUL byte)
    • Partial UTF-8 sequence followed by non-continuation byte (i.e., ASCII char or new start byte)
    • Continuation byte encountered without preceding matching start byte
  • On an error, reset the decoder to the initial state?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request therefor

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions