Skip to content

Commit 33b049b

Browse files
svenvcsvenvc
authored andcommitted
added ZnLossyUTF8Encoder
1 parent 62cdf5e commit 33b049b

File tree

16 files changed

+105
-2
lines changed

16 files changed

+105
-2
lines changed
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
I am ZnLossyUTF8Decoder.
2+
I am a ZnUTF8Decoder.
3+
4+
I behave like my superclass but will not signal errors when I see illegal UTF-8 encoded input,
5+
instead I will output a Unicode Replacement Character (U+FFFD) for each error.
6+
7+
In contrast to my superclass I can read any random byte sequence, decoding both legal and illegal UTF-8 sequences.
8+
9+
Due to my stream based design and usage as well as my stateless implementation,
10+
I will output multiple replacement characters when multiple illegal sequences occur.
11+
12+
My convenience method #decodeBytesSingleReplacement: shows how to decode bytes so that
13+
only a single replacement character stands for any amount of illegal encoding between legal encodings.
14+
15+
Part of Zinc HTTP Components.
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
accessing
2+
handlesEncoding: string
3+
"Return true when my instances handle the encoding described by string"
4+
5+
^ (self canonicalEncodingIdentifier: string) = 'utf8lossy'
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
accessing
2+
knownEncodingIdentifiers
3+
^ #( utf8lossy )
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
convenience
2+
decodeBytesSingleReplacement: bytes
3+
"Decode bytes and return the resulting string.
4+
This variant of #decodeBytes: will only ever use
5+
a single replacement character for each illegal UTF-8 sequence"
6+
7+
| byteStream replaced replacement char |
8+
byteStream := bytes readStream.
9+
replaced := false.
10+
replacement := self replacementCodePoint asCharacter.
11+
^ String streamContents: [ :stream |
12+
[ byteStream atEnd ] whileFalse: [
13+
char := self nextFromStream: byteStream.
14+
char = replacement
15+
ifTrue: [
16+
replaced
17+
ifFalse: [
18+
replaced := true.
19+
stream nextPut: replacement ] ]
20+
ifFalse: [
21+
replaced := false.
22+
stream nextPut: char ] ] ]
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
error handling
2+
errorIllegalContinuationByte
3+
^ self replacementCodePoint
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
error handling
2+
errorIllegalLeadingByte
3+
^ self replacementCodePoint
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
error handling
2+
errorIncomplete
3+
^ self replacementCodePoint
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
error handling
2+
errorOutsideRange
3+
^ self replacementCodePoint
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
error handling
2+
errorOverlong
3+
^ self replacementCodePoint
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
accessing
2+
identifier
3+
^ #utf8lossy
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
accessing
2+
replacementCodePoint
3+
"Return the code point for the Unicode Replacement Character U+FFFD"
4+
5+
^ 16rFFFD
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
{
2+
"commentStamp" : "<historical>",
3+
"super" : "ZnUTF8Encoder",
4+
"category" : "Zinc-Character-Encoding-Core",
5+
"classinstvars" : [ ],
6+
"pools" : [ ],
7+
"classvars" : [ ],
8+
"instvars" : [ ],
9+
"name" : "ZnLossyUTF8Encoder",
10+
"type" : "normal"
11+
}
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
self packageOrganizer ensurePackage: #'Zinc-Character-Encoding-Core' withTags: #()!
1+
SystemOrganization addCategory: #'Zinc-Character-Encoding-Core'!
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
testing
2+
testLossyUTF8
3+
| encoder replacement |
4+
encoder := ZnLossyUTF8Encoder new.
5+
self assert: #utf8lossy asZnCharacterEncoder equals: encoder.
6+
replacement := encoder replacementCodePoint asCharacter.
7+
self
8+
assert: (#[65 160 66] decodeWith: encoder)
9+
equals: ({ $A. replacement . $B } as: String).
10+
self
11+
assert: (#[16rE1 16rA0 16rC0] decodeWith: encoder)
12+
equals: replacement asString.
13+
self
14+
assert: (encoder decodeBytes: #[16r41 16rA1 16rA2 16rA3 16r42])
15+
equals: ({ $A. replacement . replacement . replacement . $B } as: String).
16+
self
17+
assert: (encoder decodeBytesSingleReplacement: #[16r41 16rA1 16rA2 16rA3 16r42])
18+
equals: ({ $A. replacement . $B } as: String).
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
testing
2+
testLossyUTF8Random
3+
| bytes string |
4+
bytes := ((1 to: 10000) collect: [ :_ | 256 atRandom - 1 ]) asByteArray.
5+
string := bytes decodeWith: ZnLossyUTF8Encoder new.
6+
self assert: string isString
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
self packageOrganizer ensurePackage: #'Zinc-Character-Encoding-Tests' withTags: #()!
1+
SystemOrganization addCategory: #'Zinc-Character-Encoding-Tests'!

0 commit comments

Comments
 (0)