|
| 1 | +<!DOCTYPE html> |
| 2 | +<html> |
| 3 | +<head> |
| 4 | + <meta charset="utf-8"/> |
| 5 | + <title>MultiMarkdown v6 Quick Start Guide</title> |
| 6 | + <meta name="author" content="Fletcher T. Penney"/> |
| 7 | + <meta name="version" content="6.0-b"/> |
| 8 | +</head> |
| 9 | +<body> |
| 10 | + |
| 11 | +<div class="TOC"> |
| 12 | + |
| 13 | +<ul> |
| 14 | +<li><a href="#introduction">Introduction </a></li> |
| 15 | +<li><a href="#performance">Performance </a></li> |
| 16 | +<li><a href="#parsetree">Parse Tree </a></li> |
| 17 | +<li><a href="#features">Features </a> |
| 18 | +<ul> |
| 19 | +<li><a href="#abbreviationsoracronyms">Abbreviations (Or Acronyms) </a></li> |
| 20 | +<li><a href="#citations">Citations </a></li> |
| 21 | +<li><a href="#criticmarkup">CriticMarkup </a></li> |
| 22 | +<li><a href="#emphandstrong">Emph and Strong </a></li> |
| 23 | +<li><a href="#fencedcodeblocks">Fenced Code Blocks </a></li> |
| 24 | +<li><a href="#glossaryterms">Glossary Terms </a></li> |
| 25 | +<li><a href="#internationalization">Internationalization </a></li> |
| 26 | +<li><a href="#metadata">Metadata </a></li> |
| 27 | +<li><a href="#tableofcontents">Table of Contents </a></li> |
| 28 | +</ul> |
| 29 | +</li> |
| 30 | +<li><a href="#futuresteps">Future Steps </a></li> |
| 31 | +</ul> |
| 32 | +</div> |
| 33 | + |
| 34 | +<h3 id="introduction">Introduction </h3> |
| 35 | + |
| 36 | +<p>Version: 6.0-b</p> |
| 37 | + |
| 38 | +<p>This document serves as a description of MultiMarkdown (<abbr title="MultiMarkdown">MMD</abbr>) v6, as well as a sample |
| 39 | +document to demonstrate the various features. Specifically, differences from |
| 40 | +<abbr title="MultiMarkdown">MMD</abbr> v5 will be pointed out.</p> |
| 41 | + |
| 42 | +<h3 id="performance">Performance </h3> |
| 43 | + |
| 44 | +<p>A big motivating factor leading to the development of <abbr title="MultiMarkdown">MMD</abbr> v6 was |
| 45 | +performance. When <abbr title="MultiMarkdown">MMD</abbr> first migrated from Perl to C (based on <a href="https://github.com/jgm/peg-markdown">peg- |
| 46 | +markdown</a>), it was among the fastest |
| 47 | +Markdown parsers available. That was many years ago, and the “competition” |
| 48 | +has made a great deal of progress since that time.</p> |
| 49 | + |
| 50 | +<p>When developing <abbr title="MultiMarkdown">MMD</abbr> v6, one of my goals was to keep <abbr title="MultiMarkdown">MMD</abbr> at least in the |
| 51 | +ballpark of the fastest processors. Of course, being <em>the</em> fastest would be |
| 52 | +fantastic, but I was more concerned with ensuring that the code was easily |
| 53 | +understood, and easily updated with new features in the future.</p> |
| 54 | + |
| 55 | +<p><abbr title="MultiMarkdown">MMD</abbr> v3 – v5 used a <a href="#gn:1" id="gnref:1" title="see glossary" class="glossary">PEG</a> to handle the parsing. This made it easy to |
| 56 | +understand the relationship between the <abbr title="MultiMarkdown">MMD</abbr> grammar and the parsing code, |
| 57 | +since they were one and the same. However, the parsing code generated by |
| 58 | +the parsers was not particularly fast, and was prone to troublesome edge |
| 59 | +cases with terrible performance characteristics.</p> |
| 60 | + |
| 61 | +<p>The first step in <abbr title="MultiMarkdown">MMD</abbr> v6 parsing is to break the source text into a series |
| 62 | +of tokens, which may consist of plain text, whitespace, or special characters |
| 63 | +such as ‘*’, ‘[’, etc. This chain of tokens is then used to perform the |
| 64 | +actual parsing.</p> |
| 65 | + |
| 66 | +<p><abbr title="MultiMarkdown">MMD</abbr> v6 divides the parsing into two separate phases, which actually fits |
| 67 | +more with Markdown’s design philosophically.</p> |
| 68 | + |
| 69 | +<ol> |
| 70 | +<li><p>Block parsing consists of identifying the “type” of each line of the |
| 71 | +source text, and grouping the lines into blocks (e.g. paragraphs, lists, |
| 72 | +blockquotes, etc.) Some blocks are a single line (e.g. ATX headers), and |
| 73 | +others can be many lines long. The block parsing in <abbr title="MultiMarkdown">MMD</abbr> v6 is handled |
| 74 | +by a parser generated by <a href="http://www.hwaci.com/sw/lemon/">lemon</a>. This |
| 75 | +parser allows the block structure to be more readily understood by |
| 76 | +non-programmers, but the generated parser is still fast.</p></li> |
| 77 | +<li><p>Span parsing consists of identifying Markdown/<abbr title="MultiMarkdown">MMD</abbr> structures that occur |
| 78 | +inside of blocks, such as links, images, strong, emph, etc. Most of these |
| 79 | +structures require matching pairs of tokens to specify where the span starts |
| 80 | +and where it ends. Most of these spans allow arbitrary levels of nesting as |
| 81 | +well. This made parsing them correctly in the PEG-based code difficult and |
| 82 | +slow. <abbr title="MultiMarkdown">MMD</abbr> v6 uses a different approach that is accurate and has good |
| 83 | +performance characteristics even with edge cases. Basically, it keeps a stack |
| 84 | +of each “opening” token as it steps through the token chain. When a “closing” |
| 85 | +token is found, it is paired with the most recent appropriate opener on the |
| 86 | +stack. Any tokens in between the opener and closer are removed, as they are |
| 87 | +not able to be matched any more. To avoid unnecessary searches for non- |
| 88 | +existent openers, the parser keeps track of which opening tokens have been |
| 89 | +discovered. This allows the parser to continue moving forwards without having |
| 90 | +to go backwards and re-parse any previously visited tokens.</p></li> |
| 91 | +</ol> |
| 92 | + |
| 93 | +<p>The result of this redesigned <abbr title="MultiMarkdown">MMD</abbr> parser is that it can parse short |
| 94 | +documents more quickly than <a href="http://commonmark.org/">CommonMark</a>, and takes |
| 95 | +only 15% – 20% longer to parse long documents. I have not delved too deeply |
| 96 | +into this, but I presume that CommonMark has a bit more “set-up” time that |
| 97 | +becomes expensive when parsing a short document (e.g. a paragraph or two). But |
| 98 | +this cost becomes negligible when parsing longer documents (e.g. file sizes of |
| 99 | +1 MB). So depending on your use case, CommonMark may well be faster than |
| 100 | +<abbr title="MultiMarkdown">MMD</abbr>, but we’re talking about splitting hairs here…. Recent comparisons |
| 101 | +show <abbr title="MultiMarkdown">MMD</abbr> v6 taking approximately 4.37 seconds to parse a 108 MB file |
| 102 | +(approximately 24.8 MB/second), and CommonMark took 3.72 seconds for the same |
| 103 | +file (29.2 MB/second). For comparison, <abbr title="MultiMarkdown">MMD</abbr> v5.4 took approximately 94 |
| 104 | +second for the same file (1.15 MB/second).</p> |
| 105 | + |
| 106 | +<p>For a more realistic file of approx 28 kb (the source of the Markdown Syntax |
| 107 | +web page), both <abbr title="MultiMarkdown">MMD</abbr> and CommonMark parse it too quickly to accurately |
| 108 | +measure. In fact, it requires a file consisting of the original file copied |
| 109 | +32 times over (0.85 MB) before <code>/usr/bin/env time</code> reports a time over the |
| 110 | +minimum threshold of 0.01 seconds for either program.</p> |
| 111 | + |
| 112 | +<p>There is still potentially room for additional optimization in <abbr title="MultiMarkdown">MMD</abbr>. |
| 113 | +However, even if I can’t close the performance gap with CommonMark on longer |
| 114 | +files, the additional features of <abbr title="MultiMarkdown">MMD</abbr> compared with Markdown in addition to |
| 115 | +the increased legibility of the source code of <abbr title="MultiMarkdown">MMD</abbr> (in my biased opinion |
| 116 | +anyway) make this project worthwhile.</p> |
| 117 | + |
| 118 | +<h3 id="parsetree">Parse Tree </h3> |
| 119 | + |
| 120 | +<p><abbr title="MultiMarkdown">MMD</abbr> v6 performs its parsing in the following steps:</p> |
| 121 | + |
| 122 | +<ol> |
| 123 | +<li><p>Start with a null-terminated string of source text (C style string)</p></li> |
| 124 | +<li><p>Lex string into token chain</p></li> |
| 125 | +<li><p>Parse token chain into blocks</p></li> |
| 126 | +<li><p>Parse tokens within each block into span level structures (e.g. strong, |
| 127 | +emph, etc.)</p></li> |
| 128 | +<li><p>Export the token tree into the desired output format (e.g. HTML, LaTeX, |
| 129 | +etc.) and return the resulting C style string</p> |
| 130 | + |
| 131 | +<p><strong>OR</strong></p></li> |
| 132 | +<li><p>Use the resulting token tree for your own purposes.</p></li> |
| 133 | +</ol> |
| 134 | + |
| 135 | +<p>The token tree (<a href="#gn:2" id="gnref:2" title="see glossary" class="glossary">AST</a>) includes starting offsets and length of each token, |
| 136 | +allowing you to use <abbr title="MultiMarkdown">MMD</abbr> as part of a syntax highlighter. <abbr title="MultiMarkdown">MMD</abbr> v5 did not |
| 137 | +have this functionality in the public version, in part because the PEG parsers |
| 138 | +used did not provide reliable offset positions, requiring a great deal of |
| 139 | +effort when I adapted MMD for use in <a href="http://multimarkdown.com/">MultiMarkdown |
| 140 | +Composer</a>.</p> |
| 141 | + |
| 142 | +<p>These steps are managed using the <code>mmd_engine</code> “object”. An individual |
| 143 | +<code>mmd_engine</code> cannot be used by multiple threads simultaneously, so if |
| 144 | +libMultiMarkdown is to be used in a multithreaded program, a separate |
| 145 | +<code>mmd_engine</code> should be created for each thread. Alternatively, just use the |
| 146 | +slightly more abstracted <code>mmd_convert_string()</code> function that handles creating |
| 147 | +and destroying the <code>mmd_engine</code> automatically.</p> |
| 148 | + |
| 149 | +<h3 id="features">Features </h3> |
| 150 | + |
| 151 | +<h4 id="abbreviationsoracronyms">Abbreviations (Or Acronyms) </h4> |
| 152 | + |
| 153 | +<p>This file includes the use of <abbr title="MultiMarkdown">MMD</abbr> as an abbreviation for MultiMarkdown. The |
| 154 | +abbreviation will be expanded on the first use, and the shortened form will be |
| 155 | +used on subsequent occurrences.</p> |
| 156 | + |
| 157 | +<p>Abbreviations can be specified using inline or reference syntax. The inline |
| 158 | +variant requires that the abbreviation be wrapped in parentheses and |
| 159 | +immediately follows the <code>></code>.</p> |
| 160 | + |
| 161 | +<pre><code>[>MMD] is an abbreviation. So is [>(MD) Markdown]. |
| 162 | + |
| 163 | +[>MMD]: MultiMarkdown |
| 164 | +</code></pre> |
| 165 | + |
| 166 | +<h4 id="citations">Citations </h4> |
| 167 | + |
| 168 | +<p>Citations can be specified using an inline syntax, just like inline footnotes.</p> |
| 169 | + |
| 170 | +<h4 id="criticmarkup">CriticMarkup </h4> |
| 171 | + |
| 172 | +<p><abbr title="MultiMarkdown">MMD</abbr> v6 has improved support for <a href="http://criticmarkup.com/">CriticMarkup</a>, both in terms of parsing, and |
| 173 | +in terms of support for each output format. You can <ins>insert text</ins>, |
| 174 | +<del>delete text</del>, substitute <del>one thing</del><ins>for another</ins>, <mark>highlight text</mark>, |
| 175 | +and <span class="critic comment">leave comments</span> in the text.</p> |
| 176 | + |
| 177 | +<h4 id="emphandstrong">Emph and Strong </h4> |
| 178 | + |
| 179 | +<p>The basics of emphasis and strong emphasis are unchanged, but the parsing |
| 180 | +engine has been improved to be more accurate, particularly in various edge |
| 181 | +cases where proper parsing can be difficult.</p> |
| 182 | + |
| 183 | +<h4 id="fencedcodeblocks">Fenced Code Blocks </h4> |
| 184 | + |
| 185 | +<p>Fenced code blocks are fundamentally the same as <abbr title="MultiMarkdown">MMD</abbr> v5, except:</p> |
| 186 | + |
| 187 | +<ol> |
| 188 | +<li><p>The leading and trailing fences can be 3, 4, or 5 backticks in length. That |
| 189 | +should be sufficient to account for complex documents without requiring a more |
| 190 | +complex parser.</p></li> |
| 191 | +<li><p>If there is no trailing fence, then everything after the leading fence is |
| 192 | +considered to be part of the code block.</p></li> |
| 193 | +</ol> |
| 194 | + |
| 195 | +<h4 id="glossaryterms">Glossary Terms </h4> |
| 196 | + |
| 197 | +<p>If there are terms in your document you wish to define in a <a href="#gn:3" id="gnref:3" title="see glossary" class="glossary">glossary</a> at |
| 198 | +the end of your document, you can define them using the glossary syntax.</p> |
| 199 | + |
| 200 | +<p>Glossary terms can be specified using inline or reference syntax. The inline |
| 201 | +variant requires that the abbreviation be wrapped in parentheses and |
| 202 | +immediately follows the <code>?</code>.</p> |
| 203 | + |
| 204 | +<pre><code>[?(glossary) The glossary collects information about important |
| 205 | +terms used in your document] is a glossary term. |
| 206 | + |
| 207 | +[?glossary] is also a glossary term. |
| 208 | + |
| 209 | +[?glossary]: The glossary collects information about important |
| 210 | +terms used in your document |
| 211 | +</code></pre> |
| 212 | + |
| 213 | +<h4 id="internationalization">Internationalization </h4> |
| 214 | + |
| 215 | +<p><abbr title="MultiMarkdown">MMD</abbr> v6 includes support for substituting certain text phrases in other |
| 216 | +languages. This only affects the HTML format.</p> |
| 217 | + |
| 218 | +<h4 id="metadata">Metadata </h4> |
| 219 | + |
| 220 | +<p>Metadata in <abbr title="MultiMarkdown">MMD</abbr> v6 includes new support for LaTeX – the <code>latex config</code> key |
| 221 | +allows you to automatically setup of multiple <code>latex include</code> files at once. |
| 222 | +The default setups that I use would typically consist of one LaTeX file to be |
| 223 | +included at the top of the file, one to be included right at the beginning of |
| 224 | +the document, and one to be included at the end of the document. If you want |
| 225 | +to specify the latex files separately, you can use <code>latex leader</code>, <code>latex |
| 226 | +begin</code>, and <code>latex footer</code>.</p> |
| 227 | + |
| 228 | +<h4 id="tableofcontents">Table of Contents </h4> |
| 229 | + |
| 230 | +<p>By placing <code>{{TOC}}</code> in your document, you can insert an automatically |
| 231 | +generated Table of Contents in your document. As of <abbr title="MultiMarkdown">MMD</abbr> v6, the native |
| 232 | +Table of Contents functionality is used when exporting to LaTeX or |
| 233 | +OpenDocument formats.</p> |
| 234 | + |
| 235 | +<h3 id="futuresteps">Future Steps </h3> |
| 236 | + |
| 237 | +<p>Some features I plan to implement at some point:</p> |
| 238 | + |
| 239 | +<ol> |
| 240 | +<li><p><abbr title="MultiMarkdown">MMD</abbr> v5 used to automatically identify abbreviated terms throughout the |
| 241 | +document and substitute them automatically. I plan to reimplement this |
| 242 | +functionality, but will probably improve upon it to include glossary terms, |
| 243 | +and possibly even support for indexing documents in LaTeX (and possibly |
| 244 | +OpenOffice).</p></li> |
| 245 | +<li><p>OPML export support is not available in v6. I plan on adding improved |
| 246 | +support for this at some point. I was hoping to be able to re-use the |
| 247 | +existing v6 parser but it might be simpler to use the approach from v5 and |
| 248 | +earlier, which was to have a separate parser tuned to only identify headers |
| 249 | +and “stuff between headers”.</p></li> |
| 250 | +<li><p>Improved EPUB support. Currently, EPUB support is provided by a separate |
| 251 | +<a href="https://github.com/fletcher/MMD-ePub">tool</a>. At some point, I would like to |
| 252 | +better integrate this into <abbr title="MultiMarkdown">MMD</abbr> itself.</p></li> |
| 253 | +</ol> |
| 254 | + |
| 255 | +<div class="glossary"> |
| 256 | +<hr /> |
| 257 | +<ol> |
| 258 | + |
| 259 | +<li id="gn:1"> |
| 260 | +PEG: <p>Parsing Expression Grammar <a href="https://en.wikipedia.org/wiki/Parsing_expression_grammar">https://en.wikipedia.org/wiki/Parsing_expression_grammar</a> <a href="#gnref:1" title="return to body" class="reverseglossary"> ↩</a></p> |
| 261 | +</li> |
| 262 | + |
| 263 | +<li id="gn:2"> |
| 264 | +AST: <p>Abstract Syntax Tree <a href="https://en.wikipedia.org/wiki/Abstract_syntax_tree">https://en.wikipedia.org/wiki/Abstract_syntax_tree</a> <a href="#gnref:2" title="return to body" class="reverseglossary"> ↩</a></p> |
| 265 | +</li> |
| 266 | + |
| 267 | +<li id="gn:3"> |
| 268 | +glossary: <p>The |
| 269 | +glossary collects information about important terms used in your document <a href="#gnref:3" title="return to body" class="reverseglossary"> ↩</a></p> |
| 270 | +</li> |
| 271 | + |
| 272 | +</ol> |
| 273 | +</div> |
| 274 | + |
| 275 | +</body> |
| 276 | +</html> |
| 277 | + |
0 commit comments