Skip to content

Non-ASCII class and author names break SourceFileArray>>#getPreambleFrom:at: #2395

Closed
@svenvc

Description

@svenvc

People are reporting that they get strange encoding errors when filing out code while using a non-ASCII class or author name.

Here is some initial analysis:

This is one manifestation of a problem that has been present for quite a while.

I'll start by describing what I did, what went well and where/how this fails, some generic points, and two conceptual solutions (that need further verification).

Like you, I created a new subclass:

Object subclass: #ClasseFrançaise
	instanceVariableNames: ''
	classVariableNames: ''
	package: '_UnpackagedPackage'

With comment:

I am ClasseFrançaise.

Try:

	ClasseFrançaise new élève.
	ClasseFrançaise new euro.

And two methods (in the 'test' protocol):

élève
	^ 'élève'

euro
	^ '€'

I added the euro sign (because that is encoded in UTF-8 with 3 bytes, not 2 like ç).
Like you said, the system can cope with such class and method names and seems to function fine.

Looking at the .changes file, the correct source code was appended:

----SNAPSHOT----2019-01-26T23:36:18.548555+01:00 work.image priorSource: 339848!

Object subclass: #ClasseFrançaise
       instanceVariableNames: ''
       classVariableNames: ''
       package: '_UnpackagedPackage'!
!ClasseFrançaise commentStamp: 'SvenVanCaekenberghe 1/27/2019 12:25' prior: 0!
I am ClasseFrançaise.!
!ClasseFrançaise methodsFor: 'test' stamp: 'SvenVanCaekenberghe 1/27/2019 12:26'!
élève
       ^ 'élève'! !
!ClasseFrançaise commentStamp: 'SvenVanCaekenberghe 1/27/2019 12:27' prior: 33898360!
I am ClasseFrançaise.

Try:

       ClasseFrançaise new élève.
       ClasseFrançaise new euro.
!
!ClasseFrançaise methodsFor: 'test' stamp: 'SvenVanCaekenberghe 1/27/2019 12:27'!
euro
	^ '€'! !

Doing a file out (or otherwise saving the source code) fails. The reason is an incorrect manipulation of this source file while looking for what is called the method preamble, in SourcFileArray>>#getPreambleFrom:at: position

An programmatic way to invoke the same error is by doing

(ClasseFrançaise>>#élève) timeStamp.
(ClasseFrançaise>>#élève) author.

Both fail with the same error.

The source code of methods is (currently) stored in a .sources or .changes file. CompiledMethods know their source pointer, an offset in one of these files. Right before the place where the source starts is a preamble that contains some meta information (including the author and timestamp). To access that preamble, the source code pointer is moved backwards to the beginning of the preamble (which begins and ends with a !).

The current approach fails in the presence of non-ASCII characters. More specifically because of a mixup between the concept of byte position and character position when using UTF-8, a variable length encoding (both the .changes and the .sources are UTF-8 encoded).

For example, consider

'à partir de 10 €' size. "16"
'à partir de 10 €' utf8Encoded size. "19"

So although the string contains 16 characters, it is encoded as 19 bytes, à using 2 bytes and € using 3 bytes. In general, moving backwards or forwards in UTF-8 encoded bytes cannot be done without understanding UTF-8 itself.

ZnUTF8Encoder can do both (moving forward is #nextFromStream: while moving backwards is #backOnStream:). However, ZnUTF8Encoder is also strict: it will signal an error when forced to operate in between encoded characters, which is what happens here.

It is thus not possible to move to arbitrary bytes positions and assume/hope to always arrive on the correct character boundaries and it is also wrong to take the difference between two byte positions as the count of characters present (since their encoding is of variable length).

SourcFileArray>>#getPreambleFrom:at: is doing both of these wrong (but gets away with it in 99.99% of all cases since very few people name their classes like that).

There are two solutions: operate mostly on the byte level or operate correctly on the character level. Here are two conceptual solutions (you must execute either solution 1 or 2, not both), with two different inputs.

src := '!ClasseFrançaise methodsFor: ''test'' stamp: ''SvenVanCaekenberghe 1/27/2019 12:27''!
euro
	^ ''€''! !'.

"startPosition := 83"

str := ZnCharacterReadStream on: (src utf8Encoded readStream).
str position: 83. "at start of euro, the methods source string"
str upToEnd.

str position: (83 - 3). "before ! before euro"

"find the previous ! before position"
position := str position.
binary := str wrappedStream.
encoder := str encoder.

"solution 1"
[ position >= 0 and: [ binary position: position. binary next ~= 33 ] ] whileTrue: [ position := position - 1 ].
position.
encoder decodeBytes: (binary next: 80 - position).

"solution 2"
count := 0.
[ str position >= 0 and: [ str next ~= $! ] ] whileTrue: [ 
	encoder backOnStream: binary; backOnStream: binary. count := count + 1 ].
str position.
str next: count.

Same code, different input (and starting position):

src := '!ABC!ClasseFrançaise methodsFor: ''test'' stamp: ''SvenVanCaekenberghe 1/27/2019 12:27''!
euro
	^ ''€''! !'.

"startPosition := 87"

str := ZnCharacterReadStream on: (src utf8Encoded readStream).
str position: 87. "at start of euro, the methods source string"
str upToEnd.

str position: (87 - 3). "before ! before euro"

"find the previous ! before position"
position := str position.
binary := str wrappedStream.
encoder := str encoder.

"solution 1"
[ position >= 0 and: [ binary position: position. binary next ~= 33 ] ] whileTrue: [ position := position - 1 ].
position.
encoder decodeBytes: (binary next: 80 - position).

"solution 2"
count := 0.
[ str position >= 0 and: [ str next ~= $! ] ] whileTrue: [ 
	encoder backOnStream: binary; backOnStream: binary. count := count + 1 ].
str position.
str next: count.

Solution 1 (at the byte level) works because $! (ASCII code 33) is also encoded as such in UTF-8 and this byte pattern cannot be part of a 2, 3 or 4 byte UTF-8 encoded character. The final byte segment must then be given to the proper encoder to turn it into characters.

Solution 2 (at the character level) works because it essentially counts characters while moving backwards correctly (it moves back 2 steps each time because #next moves 1 step forward, while #peek would not help).

Note that in Solution 1 the moving position is held in an external variable, while in Solution 2 it is held inside the stream.

Implementing either solution requires more internal access of the streams (#wrappedStream), so SourcFileArray>>#getPreambleFrom:at: should at least be moved to SourceFile, IMHO.

Obviously this is a sensitive/critical area to touch.

We will also need a regression unit test.

On a higher design level, I think that CompiledMethod>>#timeStamp and CompiledMethod>>#author should be cached at the instance level (a unix timestamp and a symbol would cost very little).

Sven

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions