Skip to content

[Markdown] Fix HTML comment parser. #2121

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 3, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions pkgs/markdown/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
* Update the README link to the markdown playground
(https://dart-lang.github.io/tools).
* Update `package:web` API references in the example.
* Fix performance and correctness of HTML comment parser.
* Require Dart `^3.4.0`.

## 7.3.0
Expand Down
13 changes: 8 additions & 5 deletions pkgs/markdown/lib/src/inline_syntaxes/inline_html_syntax.dart
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@
// for details. All rights reserved. Use of this source code is governed by a
// BSD-style license that can be found in the LICENSE file.

// These are RegExps, always use raw strings.
// ignore_for_file: unnecessary_raw_strings

import '../../markdown.dart';
import '../charcode.dart';
import '../patterns.dart';
Expand All @@ -22,23 +25,23 @@ class InlineHtmlSyntax extends TextSyntax {

// HTML comment, see
// https://spec.commonmark.org/0.30/#html-comment.
'<!--(?:(?:[^-<>]+-[^-<>]+)+|[^-<>]+)-->'
r'<!--(?!-?>)[^\-]*-(?:[^\-]+-)*?->'
'|'

// Processing-instruction, see
// https://spec.commonmark.org/0.30/#processing-instruction
r'<\?.*?\?>'
r'<\?[^]*?\?>'
'|'

// Declaration, see
// https://spec.commonmark.org/0.30/#declaration.
'(<![a-z]+.*?>)'
r'(<![a-zA-Z]+[^]*?>)'
'|'

// CDATA section, see
// https://spec.commonmark.org/0.30/#cdata-section.
r'(<!\[CDATA\[.*?]]>)';
r'(<!\[CDATA\[[^]*?\]\]>)';

InlineHtmlSyntax()
: super(_pattern, startCharacter: $lt, caseSensitive: false);
: super(_pattern, startCharacter: $lt, caseSensitive: true);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The <![CDATA[ tag is case sensitive, so the RegExp should be too.
Added A-Z to all the a-zs used here and in namedTagDefinition.

}
6 changes: 3 additions & 3 deletions pkgs/markdown/lib/src/patterns.dart
Original file line number Diff line number Diff line change
Expand Up @@ -65,15 +65,15 @@ const namedTagDefinition =
'<'

// Tag name.
'[a-z][a-z0-9-]*'
'[a-zA-Z][a-zA-Z0-9-]*'

// Attribute begins, see
// https://spec.commonmark.org/0.30/#attribute.
r'(?:\s+'

// Attribute name, see
// https://spec.commonmark.org/0.30/#attribute-name.
'[a-z_:][a-z0-9._:-]*'
'[a-zA-Z_:][a-zA-Z0-9._:-]*'

//
'(?:'
Expand All @@ -96,7 +96,7 @@ const namedTagDefinition =

// Closing tag, see
// https://spec.commonmark.org/0.30/#closing-tag.
r'</[a-z][a-z0-9-]*\s*>';
r'</[a-zA-Z][a-zA-Z0-9-]*\s*>';

/// A pattern to match the start of an HTML block.
///
Expand Down
56 changes: 56 additions & 0 deletions pkgs/markdown/test/regression_test.dart
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
// Copyright (c) 2025, the Dart project authors. Please see the AUTHORS file
// for details. All rights reserved. Use of this source code is governed by a
// BSD-style license that can be found in the LICENSE file.

import 'package:markdown/markdown.dart';
import 'package:test/test.dart';

void main() {
test('HTML comment with dashes #2119', () {
// See https://dartbug.com/tools/2119.
// For this issue, the leading letter was needed,
// an HTML comment starting a line is handled by a different path.
// The empty line before the `-->` is needed.
// The number of lines increase time exponentially.
// The length of lines affect the base of the exponentiation.
// Locally, three "Lorem-ipsum" lines ran in ~6 seconds, two in < 200 ms.
// Adding a fourth line should ensure it cannot possibly finish in ten
// seconds if the bug isn't fixed.
const input = '''
a <!--
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi
ut aliquip ex ea commodo consequat.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi
ut aliquip ex ea commodo consequat.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi
ut aliquip ex ea commodo consequat.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi
ut aliquip ex ea commodo consequat.
-->
''';

final time = Stopwatch()..start();
final html = markdownToHtml(input); // Should not hang.
expect(html, isNotNull); // To use the output.
final elapsed = time.elapsedMilliseconds;
expect(elapsed, lessThan(10000));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the expected runtime now? Is there a big enough margin for low powered machines running this test?

Idea: Rather than using a fixed timeout, we measure 1, 2, 3, and 4 paragraphs and verify that the runtime grows roughly linearly instead of exponentially?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be linear in the size of the input like all the other RegExps. Which means it'll drown in the noise of everything else that is being done.

In inserted a print(elapsed); in the test, with four - lines, and ran it five times. It took times in the range 115-130 ms.
Then I added 16 more entries, 5-doubling the size and ran the test again. That took 133-162 ms.
For the heck of it, I 5-doubled it again, adding 80 more entries, and it now took 142-186 ms.

The time taken by running that RegExp is trivial. I'd guess compiling it at runtime takes longer than running it on any reasonable example.

But why guess when I can check it!
I put a for (var i = 0; i < 2; i++){...} around the code of the test, to run it again after the RegExp has already been compiled.
The first run of each test takes the same kind of time 137-183 ms.
The second run takes 4-14 ms. For the 100-entry sample text.

Running the RegExp is not taking any significant time. Checking that it is linear is going to take a lot of size, 1000+ lines, before it's even measurably above noise-level (4-14 ms is a 3.5x variance).

I'm not worried. This test just checks that we don't revert the RegExp accidentally.

});

test('HTML comment with lt/gt', () {
// Incorrect parsing found as part of fixing #2119.
// Now matches `<!--$text-->` where text
// does not start with `>` or `->`, does not end with `-`,
// and does not contain `--`.
const input = 'a <!--<->>-<!-->';
final html = markdownToHtml(input);
expect(html, '<p>$input</p>\n');
});
}