Skip to content

[testharness.js] Require valid UTF-8 in test title #16253

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jugglinmike
Copy link
Contributor

Chromedriver rejects "Execute Script" payloads containing unpaired surrogates.
I was getting ready to file a bug report against that project, but I think it
may be a reasonable behavior (since WebDriver is UTF-8, and UTF-8 says those
are off-limits). That's why I'm proposing that testharness.js reports a harness
error.

That's not technically necessary, though. We could silently sanitize the titles
and report an "OK" status. I don't really like changing what the authors wrote,
though.

When tests are executed in automation, the results may be transported
over channels that strictly enforce the UTF-8 restriction on unpaired
surrogates. Avoid encoding errors in those contexts by renaming the
tests to use human-readable representations of the code points.
@jugglinmike
Copy link
Contributor Author

Here's evidence from the latest set of data on wpt.fyi:

https://wpt.fyi/results/webstorage/storage_setitem.html?sha=3bbb559&label=master&product=chrome%5Bexperimental%5D&product=edge&product=firefox%5Bexperimental%5D&product=safari%5Bexperimental%5D

It looks like Edge has trouble with the sub-tests, though it doesn't crash outright.

@jugglinmike jugglinmike changed the title Webstorage unpaired surrogates [testharness.js] Require valid UTF-8 in test title Apr 3, 2019
@gsnedders
Copy link
Member

@jugglinmike UTF-8 isn't relevant here; U+D800 is a perfectly valid UTF-8 sequence (as all ASCII characters are!). JSON specs are a bit unclear about whether lone surrogates are valid, though most people argue they are AFAICT.

I think at some point we concluded that #9415 was caused by lone surrogates (and maybe U+0000 too?), but myself and @jgraham rather viewed JSON as allowing both and therefore as somebody else's problem.

That said, I think mine and @jgraham's opinions have changed, and escaping lone surrogates seems sane (IIRC, @jgraham said at least one browser already did in some layer?). We should probably escape all noncharacters (U+0000, U+D800-U+DFFFF, U+xFFFE and U+xFFFF for x in [0, 0x10]), not just lone surrogates, though, given U+0000 also causes problems in places.

@jugglinmike
Copy link
Contributor Author

@jugglinmike UTF-8 isn't relevant here; U+D800 is a perfectly valid UTF-8 sequence (as all ASCII characters are!).

I know there are precise terms that we could use to discuss this, but I'm afraid of misusing them and confusing the discussion. I'll speak in terms of JavaScript strings since I'm on firmer ground there.

I agree that the JavaScript string "U+D800" can be expressed in UTF-8 using characters in the ASCII set. However, my reading of the spec (section 3.9) is that the JavaScript string "\ud800" can not be expressed in UTF-8:

  • Because surrogate code points are not Unicode scalar values, any UTF-8 byte
    sequence that would otherwise map to code points U+D800..U+DFFF is
    ill-formed.

So even though Chromedriver reports a crash in that case, I decided not to file a bug. Am I misinterpreting things?

@jgraham said at least one browser already did in some layer?

Yeah, some part of the equivalent stack in Firefox silently transforms the string. That's also apparent on the wpt.fyi page referenced above (Firefox is passing a test named 'localStorage["U+d800"]'.)

This patch mimics that transformation in testharness.js, expanding each offending character with a sequence of 6 ASCII characters describing it (e.g. "hello\ud800world" to "helloU+d800world").

I think at some point we concluded that #9415 was caused by lone surrogates (and maybe U+0000 too?), but myself and @jgraham rather viewed JSON as allowing both and therefore as somebody else's problem.

You may also be thinking of gh-14245, which is a patch to replace the null byte due to a bug in EdgeDriver.

My viewpoint on silent transformation hasn't changed, though. That kind of behavior may be convenient for some authors, but it might also confuse others. This admittedly involves assumptions for subjective questions like, "How convenient?", "How confusing?" and "How often?"

@jgraham
Copy link
Contributor

jgraham commented Apr 4, 2019

I agree with the part of this patch that replaces unpaired surrogates with the escape sequence in test titles; that's something that would also fail in Firefox if we ran through GeckoDriver and hoists behaviour that's currently in the wptreport formatter to the test source. The same should be done for the message string.

I disagree with the part that makes the tests error. Where we have these sequences in tests it's usually because someone is testing against a range of codepoints and auto-generating the titles and message from input data. Making this an error forces test authors to recreate exactly the same code for escaping on the test level. We should simply document that lone surrogates (and null) are escaped in the harness.

@jugglinmike
Copy link
Contributor Author

While writing this patch, I found format_value, an intentionally-exposed but undocumented function for creating "human-readable" strings from various values. Would you two be any more open to requiring explicit sanitization if we provided the necessary functionality there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants