Skip to content

How should "everything after the scheme" URLs work? #385

Open
@domenic

Description

@domenic

There are several URL types that are basically of the form scheme:<some arbitrary data>. For example, data:, mailto:, javascript:, and urn:.

The question is, how should software process these URLs? I see three main models:

  1. Treat these as non-URLs: check if the string has a leading scheme:, then look at everything after that.
    • Nothing specced does this. (Although I suspect a decent amount of un-specced non-browser software might.)
    • This is probably not a good idea, if we want to call these things URLs at all. For example, it misses canonicalizations like percent-decoding and whitespace-stripping that are otherwise common to URLs.
  2. Parse the URL. Check if its scheme is the one you want. Then, serialize them, and strip the leading scheme. (Maybe also strip the fragment?) Now process that remaining set of code units.
  3. Parse the URL. Now, validate it according to some strict criteria, such as: no username, no password, no host, no port, maybe no query, maybe no fragment. Now, process the path, and optionally process the query or fragment, if those are allowed for your scheme.
    • Nothing specced does this, yet.
    • This might be better than (2), as it is stricter validation, and more in line with the traditional RFCs, which consider these "everything after the scheme" URLs as having paths only.
    • This model seems a bit weird in that if your <some arbitrary data> contains ?s or #s, you have to model that as allowing queries and fragments, and then processing ${path}?${query}#${fragment}. Whereas (2) just lets you process the whole string at once.

An interesting example contrasting (2) and (3) is the following: javascript://somehost/%0Aalert(1)

  • In (2), it would work, and cause an alert, because the source string //somehost/\nalert(1) is interpreted as a comment followed by an alert.
  • In (3), it would fail, since we'd validate that hosts aren't present in javascript: URLs.

Another example is that mailto:///[email protected] is interpreted as containing a <some data here> of ///[email protected] in (2) and a path of /[email protected] in (3). Maybe not relevant since I doubt many mail clients will let you send email to such an address?

There are probably more interesting examples of this sort.


The purpose of this thread is to gather community thoughts on these scenarios, with an eye toward setting a precedent for future such schemes, and providing recommendations for software that processes such URLs (including both the web's specced data: and javascript:, and other schemes like mailto: or urn:).

If we decide (2) is better, we should provide better spec support for it, including helper operations and explicit recommendations to continue doing this pattern. If we decide (3) is better, we should do the same, and we should either explicitly note data: and javascript:'s processing models as legacy, or we should try to change them (which might be possible if interop is bad).

/ccing some people who might have thoughts: @mnot @jasnell @sleevi @masinter

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions