Current code in remarkShortcodes.js misunderstands how remark's block tokenizers are supposed to work

**Describe the bug**

remarkShortcodes.js uses the blockTokenizers extension mechanism of remark-parse. The documentation for the last version of remark-parse which supported this feature can be found [here](https://github.com/remarkjs/remark/tree/remark-stringify%408.1.1/packages/remark-parse#parserblocktokenizers). Quoting from there (emphasis mine):

> Tokenizers test whether a document _starts_ with a certain syntactic entity.

I understand this to mean that any registered block tokenizer (like the one registered by remarkShortcodes.js) is tried at the beginning of each "block" in a markdown document. (Probably roughly corresponding to blocks of lines separated by two consecutive newlines, but I couldn't find clear documentation on that.)

The current version of [remarkShortcodes.js](https://github.com/decaporg/decap-cms/blob/main/packages/decap-cms-widget-markdown/src/serializers/remarkShortcodes.js) is a bit complex, but I think essentially it looks for matches to the shortcode patterns at all paragraphs in the remaining document passed in by remark and then chooses the earliest of those matches.
This means that the bevaviour is both incorrect, and runtime is quadratic in the length of the source document (or at least the number of paragraphs). A previous version of remarkShortcodes.js had code that was more correct.

This currently shows up as the "Sent invalid data to remark"-warning message, which has been reported a couple of times. At least [1](https://github.com/decaporg/decap-cms/issues/4895), [2](https://github.com/decaporg/decap-cms/issues/4976), [3](https://github.com/decaporg/decap-cms/issues/2421) are related, probably more.

This warning message is emitted by a try-catch block in remarkShortcodes.js that actually papers over the logic bug in remarkShortcodes.js. Specifically, this warning message will be emitted whenever there is a shortcode that matches somewhere in the document, but not at the current parsing position. The `eat(match[0])` then causes remark to throw an error.


**To Reproduce**

Look at the source code for remarkShortcodes.js and the documentation for remark block tokenizers.

**Expected behavior**

I believe a more correct behaviour would be achieved in the following way:

* Revert remarkShortcodes.js to [this version](https://github.com/decaporg/decap-cms/blob/b0fcaec4c30e3e7028206de1d83ffdd4a6fdbba8/packages/netlify-cms-widget-markdown/src/serializers/remarkShortcodes.js).
* Clearly document and/or enforce that patterns for all components registered with registerEditorComponent have to start with `^`, to only match at the beginning of the block currently being parsed by remark.
* Clearly document and/or enforce that these patterns should **not** use RegExp multiline mode. Document that they will be run only at block boundaries. (Why should they never use multiline mode?: This would defeat the `^` at the beginning of the pattern, as the pattern will now again match at every line of the document.)
* Perhaps check in remarkShortcodes.js that `match.index == 0`, and in case this is violated, error out early on with a descriptive error message that says that a registered editor component is using an incorrect pattern. (Any such use of a pattern not starting with `^` will mess up the whole shortcode parsing code.)

**Applicable Versions:**

main branch


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Current code in remarkShortcodes.js misunderstands how remark's block tokenizers are supposed to work #7315

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Current code in remarkShortcodes.js misunderstands how remark's block tokenizers are supposed to work #7315

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions