[testharness.js] Require valid UTF-8 in test title #16253

jugglinmike · 2019-04-03T22:01:41Z

Chromedriver rejects "Execute Script" payloads containing unpaired surrogates.
I was getting ready to file a bug report against that project, but I think it
may be a reasonable behavior (since WebDriver is UTF-8, and UTF-8 says those
are off-limits). That's why I'm proposing that testharness.js reports a harness
error.

That's not technically necessary, though. We could silently sanitize the titles
and report an "OK" status. I don't really like changing what the authors wrote,
though.

When tests are executed in automation, the results may be transported over channels that strictly enforce the UTF-8 restriction on unpaired surrogates. Avoid encoding errors in those contexts by renaming the tests to use human-readable representations of the code points.

jugglinmike · 2019-04-03T22:07:01Z

Here's evidence from the latest set of data on wpt.fyi:

https://wpt.fyi/results/webstorage/storage_setitem.html?sha=3bbb559&label=master&product=chrome%5Bexperimental%5D&product=edge&product=firefox%5Bexperimental%5D&product=safari%5Bexperimental%5D

It looks like Edge has trouble with the sub-tests, though it doesn't crash outright.

gsnedders · 2019-04-03T23:29:20Z

@jugglinmike UTF-8 isn't relevant here; U+D800 is a perfectly valid UTF-8 sequence (as all ASCII characters are!). JSON specs are a bit unclear about whether lone surrogates are valid, though most people argue they are AFAICT.

I think at some point we concluded that #9415 was caused by lone surrogates (and maybe U+0000 too?), but myself and @jgraham rather viewed JSON as allowing both and therefore as somebody else's problem.

That said, I think mine and @jgraham's opinions have changed, and escaping lone surrogates seems sane (IIRC, @jgraham said at least one browser already did in some layer?). We should probably escape all noncharacters (U+0000, U+D800-U+DFFFF, U+xFFFE and U+xFFFF for x in [0, 0x10]), not just lone surrogates, though, given U+0000 also causes problems in places.

jugglinmike · 2019-04-04T01:56:11Z

@jugglinmike UTF-8 isn't relevant here; U+D800 is a perfectly valid UTF-8 sequence (as all ASCII characters are!).

I know there are precise terms that we could use to discuss this, but I'm afraid of misusing them and confusing the discussion. I'll speak in terms of JavaScript strings since I'm on firmer ground there.

I agree that the JavaScript string "U+D800" can be expressed in UTF-8 using characters in the ASCII set. However, my reading of the spec (section 3.9) is that the JavaScript string "\ud800" can not be expressed in UTF-8:

Because surrogate code points are not Unicode scalar values, any UTF-8 byte
sequence that would otherwise map to code points U+D800..U+DFFF is
ill-formed.

So even though Chromedriver reports a crash in that case, I decided not to file a bug. Am I misinterpreting things?

@jgraham said at least one browser already did in some layer?

Yeah, some part of the equivalent stack in Firefox silently transforms the string. That's also apparent on the wpt.fyi page referenced above (Firefox is passing a test named 'localStorage["U+d800"]'.)

This patch mimics that transformation in testharness.js, expanding each offending character with a sequence of 6 ASCII characters describing it (e.g. "hello\ud800world" to "helloU+d800world").

I think at some point we concluded that #9415 was caused by lone surrogates (and maybe U+0000 too?), but myself and @jgraham rather viewed JSON as allowing both and therefore as somebody else's problem.

You may also be thinking of gh-14245, which is a patch to replace the null byte due to a bug in EdgeDriver.

My viewpoint on silent transformation hasn't changed, though. That kind of behavior may be convenient for some authors, but it might also confuse others. This admittedly involves assumptions for subjective questions like, "How convenient?", "How confusing?" and "How often?"

jgraham · 2019-04-04T18:49:20Z

I agree with the part of this patch that replaces unpaired surrogates with the escape sequence in test titles; that's something that would also fail in Firefox if we ran through GeckoDriver and hoists behaviour that's currently in the wptreport formatter to the test source. The same should be done for the message string.

I disagree with the part that makes the tests error. Where we have these sequences in tests it's usually because someone is testing against a range of codepoints and auto-generating the titles and message from input data. Making this an error forces test authors to recreate exactly the same code for escaping on the test level. We should simply document that lone surrogates (and null) are escaped in the harness.

jugglinmike · 2019-04-05T21:54:06Z

While writing this patch, I found format_value, an intentionally-exposed but undocumented function for creating "human-readable" strings from various values. Would you two be any more open to requiring explicit sanitization if we provided the necessary functionality there?

jugglinmike added 2 commits April 2, 2019 21:41

[testharness.js] Require valid UTF-8 in test title

2a20b43

wpt-pr-bot added infra testharness.js webstorage wg-webplatform labels Apr 3, 2019

wpt-pr-bot assigned zqzhang Apr 3, 2019

wpt-pr-bot requested review from gsnedders, inexorabletash, jdm, jgraham, siusin, tobie and zqzhang April 3, 2019 22:01

jugglinmike changed the title ~~Webstorage unpaired surrogates~~ [testharness.js] Require valid UTF-8 in test title Apr 3, 2019

jugglinmike mentioned this pull request Apr 5, 2019

[testharness.js] Sanitize unpaired surrogates #16280

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[testharness.js] Require valid UTF-8 in test title #16253

[testharness.js] Require valid UTF-8 in test title #16253

Uh oh!

jugglinmike commented Apr 3, 2019

Uh oh!

jugglinmike commented Apr 3, 2019

Uh oh!

gsnedders commented Apr 3, 2019

Uh oh!

jugglinmike commented Apr 4, 2019

Uh oh!

jgraham commented Apr 4, 2019

Uh oh!

jugglinmike commented Apr 5, 2019

Uh oh!

Uh oh!

[testharness.js] Require valid UTF-8 in test title #16253

Are you sure you want to change the base?

[testharness.js] Require valid UTF-8 in test title #16253

Uh oh!

Conversation

jugglinmike commented Apr 3, 2019

Uh oh!

jugglinmike commented Apr 3, 2019

Uh oh!

gsnedders commented Apr 3, 2019

Uh oh!

jugglinmike commented Apr 4, 2019

Uh oh!

jgraham commented Apr 4, 2019

Uh oh!

jugglinmike commented Apr 5, 2019

Uh oh!

Uh oh!