Skip to content

[v2] Language combinations / extensions / embeddings / ... #3927

@LeaVerou

Description

@LeaVerou

I've been thinking a lot about what's the best way to handle language definitions that depend or make use of other languages. Some earlier thoughts in:

I have a strong hunch that these are all facets of the same problem and a good API design will minimize the number of separate solutions for each of them, so I'm going to close all three so we can discuss them holistically here.

Note

This is a work in progress and I will update it as I think more about this problem.

But before I go into the weeds, an illustration (real screenshot of our code, taken from VS Code):
Image
This is 4 languages nested in each other!

  1. The outer language is JS (well, TS)
  2. JSDoc in JS doc comments
  3. Markdown in JSDoc
  4. JS in Markdown code blocks

Use cases

There are currently two types of dependencies:

  1. Required (actual ESM import)
  2. Optional (use them if something else imported them, np otherwise)

And several types of use cases described below.

_Note: Any mention of "now" refers to the simplify branch (draft PR) and not the current v2 branch.

1. Languages using another language as a base (e.g. JavaScript using C-like)

This is the most straightforward case: just simple inheritance.

  • The base language is now declared via the base key instead of an imperative extend() call (I wonder if parent or extends may be better names) and is considered a required dependency.
  • It is imported as a regular ESM import
  • Its grammar is passed to the language's grammar() function via a base key that resolves synchronously.
  • By default any tokens specified by the child grammar overwrite tokens in the base grammar. If something different is desired, there are the following escape hatches:
    • $merge does a deep merge of certain tokens instead
    • $insertBefore inserts certain tokens before another
    • $insertAfter inserts certain tokens after another
    • $insert is a shorthand version of $insertBefore/$insertAfter that is better suited to one-off inserts as the position is specified inside its value via $before/$after
    • All of the above are combined into a single patch and are applied as late as possible (but I'm still debating whether that's a good idea).

Usually, the base language is useful on its own. E.g. clike was not just created to make its child languages more DRY, but to have something to fall back on when one wanted to highlight a C-like language that did not have a dedicated language definition (admittedly far more important when Prism first launched with like 5 languages compared to now).

These days, there are also cases where language definitions exist for the sole reason of making other language definitions more DRY (such as javadoclike, which is a perversion of the concept. No language should be registered and become available as a language-xxx class if it's not useful on its own, otherwise it's not actually a language, it's a shared utility.

2. Languages embedding/embedded in other languages

This can be broken down into two major categories:

  1. Where the language is known in advance (e.g. JS inside <script> elements)
  2. Where the language is not known in advance (e.g. code inside a tagged markdown code block or when highlighting http requests)

1 was already handled by special casing strings as values of $rest/$inside, but 2 was severely problematic and required a fair bit of custom code. #3923 proposes a $language descriptor that can handle both, by taking a function as well which takes named capturing groups as a parameter. I'm still unsure if this is a good solution.

There is also the question of what types of dependencies these are: are they optional or required? It seems like it could go either way, depending on the user's goal, but I'm leaning towards required. But then, for 2, does that mean that now your required dependencies depend on the code being matched?

Perhaps these are actually the only true "optional" dependencies, and there should be a way for Prism users to autoload these as well. In that case, perhaps grammars should support async nodes for these that resolve when they are loaded. The way code is tokenized could easily support parts of it being deferred for later.

3. Languages that are preprocessors for other languages

Example: PHP or Liquid are HTML preprocessors.

This is what #3911 was about. It is further complicated by the fact that these preprocessors could often generate anything, but definitely do have defaults (usually markup). This is the prime reason custom tokenizers exist, which I would love to get rid of.

I no longer think most of #3911 was a good idea, but there is one part that I still think was: languages being able to declare what language they produce, and have that be overridable via two-id language classes (e.g. language-diff-css to highlight a CSS diff or language-liquid-css for a Liquid template that produces CSS.

I'm still unsure how exactly these work today, but perhaps a good solution to 2 could also address these (by essentially emulating $rest: childLanguage).

4. Languages that can make other language definitions "richer" but are not strictly necessary

This one is the hairiest category as it encompasses so many diverse use cases.

Examples:

  • javastacktrace extending log
  • Tags inside VB.NET/F# doc comments being highlighted as tags if markup is loaded. This one is basically highlighting the need for a shared utility for tag.
  • JSON in http being highlighted as JSON if that is loaded, or as JS otherwise. That seems to be a bonafide optional dependency.
  • markdown in graphql comments being highlighted if it is loaded. That seems to be a bonafide optional dependency.
  • jsdoc in JS doc-comment tokens is highlighted if jsdoc is loaded. That seems to be a bonafide optional dependency.
  • js-templates extending JS with the ability to highlight template literals tagged with a certain language. Not everyone highlighting JS wants to highlight tagged template literals, but since JS is the host language, it cannot be language-js-templates that activates this functionality.
  • opencl-extensions extending C and C++. Not everyone highlighting C/C++ wants this.
  • css-extras extending css with specialized tokens for selectors etc. Not everyone highlighting CSS wants the granularity of css-extras.

Languages should not modify other languages

Previously, there were more of these, which existed for the sole reason of reducing bundle sizes to the extreme (like saving 1KB). These are now eliminated. The ones that remain are those that fundamentally should involve user choice, as described above.

The toughest of all are those like the last three, which are also currently the only uses for extends (#3911). Languages extending other languages are deeply problematic:

Optional dependencies beyond actual embedding are a smell

Even in use cases that are "proper" optional dependencies, it feels that this logic should really live with the child language. But …if it does, that would mean the child language modifies the parent language, which, as described above, is evil!

Not necessarily. Languages modifying other languages was one way to do it. What if there are others?

Essentially, in all of these, we have one language adding granularity to another. In most of them, we don't want users to have to opt-in separately for every use, so languages modifying other languages was invented as a solution to that. E.g. you may want all your CSS examples to be highlighted with the granularity of css-extras, and it would be annoying (and incompatible) to have to specify language-css-extras each time. In many an opt-in doesn't make much sense at all, and it's really about not bloating the bundle size. E.g. of course you want to highlight JS in HTML if you have a JS language definition loaded.

Ideas

These are currently mainly for 4. I have some ideas for the rest, but it's 4 that is the hairiest.

1. Language aliases

We could extend the concept of language aliases to existing languages. Then e.g. css-extras could be defined as just regular inheritance over css and one could alias css to css-extras.

  • Pros: Predictable, not affected by ordering effects, re-uses an existing mechanism (inheritance)
  • Cons: Lacks composability. How do I use two different types of "extras"?

2. Language extensions layered on top of existing languages without modifying them

Instead of language extensions actually mutating the host language, what if languages could declare that they are automatically applied within certain tokens of other languages?

  • Pros: Composability
  • Cons: Unclear if this would actually not cause the same issues as languages being mutated, since in theory there could still be ordering effects here

3. Language modifications with defined ordering

Languages like css-extras are never autoloaded, right? They need to be explicitly loaded …somehow. So perhaps the ordering effects go away on their own in v2, simply because loading order is much more well defined.

Additionally, we could soften the blow by making it configurable with a Prism config option for how to handle extends languages:

  1. Default: Modify the parent language
  2. Create a new language definition. E.g. you'd need to use language-css-extras explicitly to use css-extras

In fact, we could create the new language definition anyway.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions