[v2] Language combinations / extensions / embeddings / ...

I've been thinking a lot about what's the best way to handle language definitions that depend or make use of other languages. Some earlier thoughts in:
- https://github.com/PrismJS/prism/issues/3911
- https://github.com/PrismJS/prism/issues/3923
- https://github.com/PrismJS/prism/issues/3916

I have a strong hunch that these are all facets of the same problem and a good API design will minimize the number of separate solutions for each of them, so I'm going to close all three so we can discuss them holistically here.

> [!NOTE]
> This is a work in progress and I will update it as I think more about this problem.

But before I go into the weeds, an illustration (real screenshot of our code, taken from VS Code):
<img width="837" alt="Image" src="https://github.com/user-attachments/assets/3a4e8e0f-2d06-4215-9b5e-b9b01ecdc9f6" />
This is 4 languages nested in each other!
1. The outer language is JS (well, TS)
2. JSDoc in JS doc comments
3. Markdown in JSDoc
4. JS in Markdown code blocks

## Use cases 

There are currently two types of dependencies:
1. Required (actual ESM import)
2. Optional (use them if something else imported them, np otherwise)

And several types of use cases described below.

_**Note:** Any mention of "now" refers to the `simplify` branch (draft PR) and not the current `v2` branch.

### 1. Languages using another language as a base (e.g. JavaScript using C-like)

This is the most straightforward case: just simple inheritance.

- The base language is now declared via the `base` key instead of an imperative `extend()` call (I wonder if `parent` or `extends` may be better names) and is considered a required dependency.
- It is imported as a regular ESM import
- Its grammar is passed to the language's `grammar()` function via a `base` key that resolves synchronously. 
- By default any tokens specified by the child grammar overwrite tokens in the base grammar. If something different is desired, there are the following escape hatches:
  - `$merge` does a deep merge of certain tokens instead
  - `$insertBefore` inserts certain tokens before another
  - `$insertAfter` inserts certain tokens after another
  - `$insert` is a shorthand version of `$insertBefore`/`$insertAfter` that is better suited to one-off inserts as the position is specified inside its value via `$before`/`$after`
  - All of the above are combined into a single patch and are applied as late as possible (but I'm still debating whether that's a good idea).

Usually, the base language is useful on its own. E.g. `clike` was not just created to make its child languages more DRY, but to have something to fall back on when one wanted to highlight a C-like language that did not have a dedicated language definition (admittedly far more important when Prism first launched with like 5 languages compared to now).

These days, there are also cases where language definitions exist for the sole reason of making other language definitions more DRY (such as `javadoclike`, which is a perversion of the concept. No language should be registered and become available as a `language-xxx` class if it's not useful on its own, otherwise it's not actually a language, it's a shared utility.

### 2. Languages embedding/embedded in other languages

This can be broken down into two major categories:
  1. Where the language is known in advance (e.g. JS inside `<script>` elements)
  2. Where the language is not known in advance (e.g. code inside a tagged `markdown` code block or when highlighting `http` requests)

1 was already handled by special casing strings as values of `$rest`/`$inside`, but 2 was severely problematic and required a fair bit of custom code. #3923 proposes a `$language` descriptor that can handle both, by taking a function as well which takes named capturing groups as a parameter. I'm still unsure if this is a good solution.

There is also the question of what types of dependencies these are: are they optional or required? It seems like it could go either way, depending on the user's goal, but I'm leaning towards required. But then, for 2, does that mean that now your required dependencies depend on the code being matched?

Perhaps these are actually the only _true_ "optional" dependencies, and there should be a way for Prism users to autoload these as well. In that case, perhaps grammars should support async nodes for these that resolve when they are loaded. The way code is tokenized could easily support parts of it being deferred for later.

### 3. Languages that are preprocessors for other languages

Example: PHP or Liquid are HTML preprocessors.

This is what #3911 was about.  It is further complicated by the fact that these preprocessors could often generate anything, but definitely do have defaults (usually `markup`). This is the prime reason custom tokenizers exist, which I would love to get rid of.

I no longer think most of #3911 was a good idea, but there is one part that I still think was: languages being able to declare what language they _produce_, and have that be overridable via two-id language classes (e.g. `language-diff-css` to highlight a CSS diff or `language-liquid-css` for a Liquid template that produces CSS.

I'm still unsure how exactly these work today, but perhaps a good solution to 2 could also address these (by essentially emulating `$rest: childLanguage`).

### 4. Languages that can make other language definitions "richer" but are not strictly necessary 

This one is the hairiest category as it encompasses so many diverse use cases.

Examples:
- `javastacktrace` extending `log`
- Tags inside VB.NET/F# doc comments being highlighted as tags if `markup` is loaded. This one is basically highlighting the need for a shared utility for `tag`.
- JSON in `http` being highlighted as JSON if that is loaded, or as JS otherwise. That seems to be a bonafide optional dependency.
- `markdown` in `graphql` comments being highlighted if it is loaded. That seems to be a bonafide optional dependency.
- `jsdoc` in JS `doc-comment` tokens is highlighted if `jsdoc` is loaded. That seems to be a bonafide optional dependency.
- `js-templates` extending JS with the ability to highlight template literals tagged with a certain language. Not everyone highlighting JS wants to highlight tagged template literals, but since JS is the host language, it cannot be `language-js-templates` that activates this functionality.
- `opencl-extensions` extending C and C++. Not everyone highlighting C/C++ wants this.
- `css-extras` extending `css` with specialized tokens for selectors etc. Not everyone highlighting CSS wants the granularity of `css-extras`.

#### Languages should not modify other languages

Previously, there were more of these, which existed for the sole reason of reducing bundle sizes to the extreme (like saving 1KB). These are now eliminated. The ones that remain are those that fundamentally should involve user choice, as described above.

The toughest of all are those like the last three, which are also currently the only uses for `extends` (#3911). Languages extending other languages are deeply problematic:
- It means their resolved grammars cannot be cached — every Prism instance needs to spawn its own (though that might be unavoidable since plugins could also modify them)
- @RunDevelopment warned very strongly against them in https://github.com/PrismJS/prism/issues/3911#issuecomment-2833647757 mentioning ordering effects that created chaos.

#### Optional dependencies beyond actual embedding are a smell

Even in use cases that are "proper" optional dependencies, it feels that this logic should really live with the child language. But …if it does, that would mean the child language modifies the parent language, which, as described above, is evil! 

_Not necessarily._ Languages modifying other languages was one way to do it. What if there are others?

Essentially, in all of these, we have one language adding granularity to another. In most of them, we don't want users to have to opt-in separately for every use, so languages modifying other languages was invented as a solution to that. E.g. you may want all your CSS examples to be highlighted with the granularity of `css-extras`, and it would be annoying (and incompatible) to have to specify `language-css-extras` each time. In many an opt-in doesn't make much sense at all, and it's really about not bloating the bundle size. E.g. of course you want to highlight JS in HTML if you have a JS language definition loaded.

## Ideas

These are currently mainly for 4. I have some ideas for the rest, but it's 4 that is the hairiest.

### 1. Language aliases

We could extend the concept of language aliases to existing languages. Then e.g. `css-extras` could be defined as just regular inheritance over `css` and one could alias `css` to `css-extras`.

- Pros: Predictable, not affected by ordering effects, re-uses an existing mechanism (inheritance)
- Cons: Lacks composability. How do I use two different types of "extras"?

### 2. Language extensions layered on top of existing languages without modifying them

Instead of language extensions actually mutating the host language, what if languages could declare that they are automatically applied within certain tokens of other languages?

- Pros: Composability
- Cons: Unclear if this would actually not cause the same issues as languages being mutated, since in theory there could still be ordering effects here

### 3. Language modifications with defined ordering

Languages like `css-extras` are never autoloaded, right? They need to be explicitly loaded …somehow. So perhaps the ordering effects go away on their own in v2, simply because loading order is much more well defined.

Additionally, we could soften the blow by making it configurable with a Prism config option for how to handle `extends` languages: 
1. Default: Modify the parent language
2. Create a new language definition. E.g. you'd need to use `language-css-extras` explicitly to use `css-extras`

In fact, we could create the new language definition *anyway*.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[v2] Language combinations / extensions / embeddings / ... #3927

Use cases

1. Languages using another language as a base (e.g. JavaScript using C-like)

2. Languages embedding/embedded in other languages

3. Languages that are preprocessors for other languages

4. Languages that can make other language definitions "richer" but are not strictly necessary

Languages should not modify other languages

Optional dependencies beyond actual embedding are a smell

Ideas

1. Language aliases

2. Language extensions layered on top of existing languages without modifying them

3. Language modifications with defined ordering

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[v2] Language combinations / extensions / embeddings / ... #3927

Description

Use cases

1. Languages using another language as a base (e.g. JavaScript using C-like)

2. Languages embedding/embedded in other languages

3. Languages that are preprocessors for other languages

4. Languages that can make other language definitions "richer" but are not strictly necessary

Languages should not modify other languages

Optional dependencies beyond actual embedding are a smell

Ideas

1. Language aliases

2. Language extensions layered on top of existing languages without modifying them

3. Language modifications with defined ordering

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions