Skip to content

[editorial] Rephrase encoding note to make the implications clearer. #804

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions url.bs
Original file line number Diff line number Diff line change
Expand Up @@ -2038,8 +2038,9 @@ and <a>code points</a> in the range U+00A0 to U+10FFFD, inclusive, excluding <a>
<!-- IRI also excludes the ranges U+E000 to U+F8FF, U+FFF0 to U+FFFD, and U+E0000 to U+E09FF, all
inclusive. We don't to align with HTML. -->

<p class=note>Code points greater than U+007F DELETE will be converted to
<a lt="percent-encoded byte">percent-encoded bytes</a> by the <a>URL parser</a>.
<p class=note>For historical reasons, rather than storing codepoints and [=byte/percent-encoding=]
to ASCII for serialization, URLs instead store their value as ASCII internally, eagerly converting
code points greater than U+007F DELETE to [=percent-encoded bytes=] during [=URL parser|parsing=].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for not responding to this more quickly, but I think I never ended up merging it because I'm not sure this is correct. I suspect one could convert at serialization time instead. It's just not how the specification is written.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, the spec could be written another way (potentially), but it's currently not written that way, and the specifics of how the data is encoded/represented at this point in the spec are important, so I know that the URL structure only includes ASCII code points. If we changed to a "convert at serialization" model, that would also be important to note, so it was clear that the URL structure includes non-ASCII code points.

As I said, the nature of this note actively confused me - the spec talks about "URL code points" including non-ASCII codepoints, but URLs themselves do not contain these code points, and that wasn't clear to me from how the note was written.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. 1) It's not for "historical reasons". 2) This section is really about writing URLs, it isn't really about their internal representation at all. That's section 4.1 and that already makes it clear most components are ASCII strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "for historical reasons" was me assuming and editorializing. (It seemed like a weird thing to do! It's not usually good practice to encode into the byte format immediately; usually you hold it in the good data model and only encode at the edges, when you have to hit the wire.) If that's not true, and it really is just a quirk of the model, I can rephrase that bit.

And this section is about writing URLs, sure, but there was already a note about how those codepoints you write will be encoded. I was just rewriting the note for (imo) better clarity. If there's a better place to make this note, I can move it there, but this section does seem relatively germane to what the note is saying (since URLs can "contain" high codepoints, but the actual internal representation is ASCII-only).


<p class=note>In HTML, when the document encoding is a legacy encoding, code points in the
<a>URL-query string</a> that are higher than U+007F DELETE will be converted to
Expand Down