Skip to content

Commit 288f1cb

Browse files
Tokenizers lex their own child tokens (#2124)
BREAKING CHANGES: - Tokenizers will create their own tokens with `this.lexer.inline(text, tokens)`. The `inline` function will queue the token creation until after all block tokens are rendered. - `nptable` tokenizer is removed and merged with `table` tokenizer. - Extensions tokenizer `this` object will include the `lexer` as a property. `this.inlineTokens` becomes `this.lexer.inline`. - Extensions parser `this` object will include the `parser` as a property. `this.parseInline` becomes `this.parser.parseInline`. - `tag` and `inlineText` tokenizer function signatures have changed.
1 parent 20bda6e commit 288f1cb

File tree

9 files changed

+206
-302
lines changed

9 files changed

+206
-302
lines changed

docs/USING_PRO.md

Lines changed: 25 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -226,7 +226,7 @@ console.log(marked('$ latex code $\n\n` other code `'));
226226
### Inline level tokenizer methods
227227

228228
- <code>**escape**(*string* src)</code>
229-
- <code>**tag**(*string* src, *bool* inLink, *bool* inRawBlock)</code>
229+
- <code>**tag**(*string* src)</code>
230230
- <code>**link**(*string* src)</code>
231231
- <code>**reflink**(*string* src, *object* links)</code>
232232
- <code>**emStrong**(*string* src, *string* maskedSrc, *string* prevChar)</code>
@@ -235,7 +235,7 @@ console.log(marked('$ latex code $\n\n` other code `'));
235235
- <code>**del**(*string* src)</code>
236236
- <code>**autolink**(*string* src, *function* mangle)</code>
237237
- <code>**url**(*string* src, *function* mangle)</code>
238-
- <code>**inlineText**(*string* src, *bool* inRawBlock, *function* smartypants)</code>
238+
- <code>**inlineText**(*string* src, *function* smartypants)</code>
239239

240240
`mangle` is a method that changes text to HTML character references:
241241

@@ -331,11 +331,15 @@ The returned token can also contain any other custom parameters of your choice t
331331
The tokenizer function has access to the lexer in the `this` object, which can be used if any internal section of the string needs to be parsed further, such as in handling any inline syntax on the text within a block token. The key functions that may be useful include:
332332

333333
<dl>
334-
<dt><code><strong>this.blockTokens</strong>(<i>string</i> text)</code></dt>
335-
<dd>Runs the block tokenizer functions (including any extensions) on the provided text, and returns an array containing a nested tree of tokens.</dd>
334+
<dt><code><strong>this.lexer.blockTokens</strong>(<i>string</i> text, <i>array</i> tokens)</code></dt>
335+
<dd>This runs the block tokenizer functions (including any block-level extensions) on the provided text, and appends any resulting tokens onto the <code>tokens</code> array. The <code>tokens</code> array is also returned by the function. You might use this, for example, if your extension creates a "container"-type token (such as a blockquote) that can potentially include other block-level tokens inside.</dd>
336336

337-
<dt><code><strong>this.inlineTokens</strong>(<i>string</i> text)</code></dt>
338-
<dd>Runs the inline tokenizer functions (including any extensions) on the provided text, and returns an array containing a nested tree of tokens. This can be used to generate the <code>tokens</code> parameter.</dd>
337+
<dl>
338+
<dt><code><strong>this.lexer.inline</strong>(<i>string</i> text, <i>array</i> tokens)</code></dt>
339+
<dd>Parsing of inline-level tokens only occurs after all block-level tokens have been generated. This function adds <code>text</code> and <code>tokens</code> to a queue to be processed using inline-level tokenizers (including any inline-level extensions) at that later step. Tokens will be generated using the provided <code>text</code>, and any resulting tokens will be appended to the <code>tokens</code> array. Note that this function does **NOT** return anything since the inline processing cannot happen until the block-level processing is complete.</dd>
340+
341+
<dt><code><strong>this.lexer.inlineTokens</strong>(<i>string</i> text, <i>array</i> tokens)</code></dt>
342+
<dd>Sometimes an inline-level token contains further nested inline tokens (such as a <pre><code>**strong**</code></pre> token inside of a <pre><code>### Heading</code></pre>). This runs the inline tokenizer functions (including any inline-level extensions) on the provided text, and appends any resulting tokens onto the <code>tokens</code> array. The <code>tokens</code> array is also returned by the function.</dd>
339343
</dl>
340344

341345
<dt><code><strong>renderer</strong>(<i>object</i> token)</code></dt>
@@ -344,11 +348,11 @@ The tokenizer function has access to the lexer in the `this` object, which can b
344348
The renderer function has access to the parser in the `this` object, which can be used if any part of the token needs needs to be parsed further, such as any child tokens. The key functions that may be useful include:
345349

346350
<dl>
347-
<dt><code><strong>this.parse</strong>(<i>array</i> tokens)</code></dt>
348-
<dd>Runs the block renderer functions (including any extensions) on the provided array of tokens, and returns the resulting HTML string output.</dd>
351+
<dt><code><strong>this.parser.parse</strong>(<i>array</i> tokens)</code></dt>
352+
<dd>Runs the block renderer functions (including any extensions) on the provided array of tokens, and returns the resulting HTML string output. This is used to generate the HTML from any child block-level tokens, for example if your extension is a "container"-type token (such as a blockquote) that can potentially include other block-level tokens inside.</dd>
349353

350-
<dt><code><strong>this.parseInline</strong>(<i>array</i> tokens)</code></dt>
351-
<dd>Runs the inline renderer functions (including any extensions) on the provided array of tokens, and returns the resulting HTML string output. This could be used to generate text from any child tokens, for example.</dd>
354+
<dt><code><strong>this.parser.parseInline</strong>(<i>array</i> tokens)</code></dt>
355+
<dd>Runs the inline renderer functions (including any extensions) on the provided array of tokens, and returns the resulting HTML string output. This is used to generate the HTML from any child inline-level tokens.</dd>
352356
</dl>
353357

354358
</dd>
@@ -371,16 +375,18 @@ const descriptionlist = {
371375
const rule = /^(?::[^:\n]+:[^:\n]*(?:\n|$))+/; // Regex for the complete token
372376
const match = rule.exec(src);
373377
if (match) {
374-
return { // Token to generate
378+
const token = { // Token to generate
375379
type: 'descriptionList', // Should match "name" above
376380
raw: match[0], // Text to consume from the source
377381
text: match[0].trim(), // Additional custom properties
378-
tokens: this.inlineTokens(match[0].trim()) // inlineTokens to process **bold**, *italics*, etc.
382+
tokens: [] // Array where child inline tokens will be generated
379383
};
384+
this.lexer.inline(token.text, token.tokens); // Queue this data to be processed for inline tokens
385+
return token;
380386
}
381387
},
382388
renderer(token) {
383-
return `<dl>${this.parseInline(token.tokens)}\n</dl>`; // parseInline to turn child tokens into HTML
389+
return `<dl>${this.parser.parseInline(token.tokens)}\n</dl>`; // parseInline to turn child tokens into HTML
384390
}
385391
};
386392

@@ -392,16 +398,16 @@ const description = {
392398
const rule = /^:([^:\n]+):([^:\n]*)(?:\n|$)/; // Regex for the complete token
393399
const match = rule.exec(src);
394400
if (match) {
395-
return { // Token to generate
396-
type: 'description', // Should match "name" above
397-
raw: match[0], // Text to consume from the source
398-
dt: this.inlineTokens(match[1].trim()), // Additional custom properties
399-
dd: this.inlineTokens(match[2].trim())
401+
return { // Token to generate
402+
type: 'description', // Should match "name" above
403+
raw: match[0], // Text to consume from the source
404+
dt: this.lexer.inlineTokens(match[1].trim()), // Additional custom properties, including
405+
dd: this.lexer.inlineTokens(match[2].trim()) // any further-nested inline tokens
400406
};
401407
}
402408
},
403409
renderer(token) {
404-
return `\n<dt>${this.parseInline(token.dt)}</dt><dd>${this.parseInline(token.dd)}</dd>`;
410+
return `\n<dt>${this.parser.parseInline(token.dt)}</dt><dd>${this.parser.parseInline(token.dd)}</dd>`;
405411
},
406412
childTokens: ['dt', 'dd'], // Any child tokens to be visited by walkTokens
407413
walkTokens(token) { // Post-processing on the completed token tree

src/Lexer.js

Lines changed: 34 additions & 100 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,13 @@ module.exports = class Lexer {
5555
this.options.tokenizer = this.options.tokenizer || new Tokenizer();
5656
this.tokenizer = this.options.tokenizer;
5757
this.tokenizer.options = this.options;
58+
this.tokenizer.lexer = this;
59+
this.inlineQueue = [];
60+
this.state = {
61+
inLink: false,
62+
inRawBlock: false,
63+
top: true
64+
};
5865

5966
const rules = {
6067
block: block.normal,
@@ -109,27 +116,30 @@ module.exports = class Lexer {
109116
.replace(/\r\n|\r/g, '\n')
110117
.replace(/\t/g, ' ');
111118

112-
this.blockTokens(src, this.tokens, true);
119+
this.blockTokens(src, this.tokens);
113120

114-
this.inline(this.tokens);
121+
let next;
122+
while (next = this.inlineQueue.shift()) {
123+
this.inlineTokens(next.src, next.tokens);
124+
}
115125

116126
return this.tokens;
117127
}
118128

119129
/**
120130
* Lexing
121131
*/
122-
blockTokens(src, tokens = [], top = true) {
132+
blockTokens(src, tokens = []) {
123133
if (this.options.pedantic) {
124134
src = src.replace(/^ +$/gm, '');
125135
}
126-
let token, i, l, lastToken, cutSrc, lastParagraphClipped;
136+
let token, lastToken, cutSrc, lastParagraphClipped;
127137

128138
while (src) {
129139
if (this.options.extensions
130140
&& this.options.extensions.block
131141
&& this.options.extensions.block.some((extTokenizer) => {
132-
if (token = extTokenizer.call(this, src, tokens)) {
142+
if (token = extTokenizer.call({ lexer: this }, src, tokens)) {
133143
src = src.substring(token.raw.length);
134144
tokens.push(token);
135145
return true;
@@ -156,6 +166,8 @@ module.exports = class Lexer {
156166
if (lastToken && lastToken.type === 'paragraph') {
157167
lastToken.raw += '\n' + token.raw;
158168
lastToken.text += '\n' + token.text;
169+
this.inlineQueue.pop();
170+
this.inlineQueue[this.inlineQueue.length - 1].src = lastToken.text;
159171
} else {
160172
tokens.push(token);
161173
}
@@ -176,13 +188,6 @@ module.exports = class Lexer {
176188
continue;
177189
}
178190

179-
// table no leading pipe (gfm)
180-
if (token = this.tokenizer.nptable(src)) {
181-
src = src.substring(token.raw.length);
182-
tokens.push(token);
183-
continue;
184-
}
185-
186191
// hr
187192
if (token = this.tokenizer.hr(src)) {
188193
src = src.substring(token.raw.length);
@@ -193,18 +198,13 @@ module.exports = class Lexer {
193198
// blockquote
194199
if (token = this.tokenizer.blockquote(src)) {
195200
src = src.substring(token.raw.length);
196-
token.tokens = this.blockTokens(token.text, [], top);
197201
tokens.push(token);
198202
continue;
199203
}
200204

201205
// list
202206
if (token = this.tokenizer.list(src)) {
203207
src = src.substring(token.raw.length);
204-
l = token.items.length;
205-
for (i = 0; i < l; i++) {
206-
token.items[i].tokens = this.blockTokens(token.items[i].text, [], false);
207-
}
208208
tokens.push(token);
209209
continue;
210210
}
@@ -217,7 +217,7 @@ module.exports = class Lexer {
217217
}
218218

219219
// def
220-
if (top && (token = this.tokenizer.def(src))) {
220+
if (this.state.top && (token = this.tokenizer.def(src))) {
221221
src = src.substring(token.raw.length);
222222
if (!this.tokens.links[token.tag]) {
223223
this.tokens.links[token.tag] = {
@@ -250,18 +250,20 @@ module.exports = class Lexer {
250250
const tempSrc = src.slice(1);
251251
let tempStart;
252252
this.options.extensions.startBlock.forEach(function(getStartIndex) {
253-
tempStart = getStartIndex.call(this, tempSrc);
253+
tempStart = getStartIndex.call({ lexer: this }, tempSrc);
254254
if (typeof tempStart === 'number' && tempStart >= 0) { startIndex = Math.min(startIndex, tempStart); }
255255
});
256256
if (startIndex < Infinity && startIndex >= 0) {
257257
cutSrc = src.substring(0, startIndex + 1);
258258
}
259259
}
260-
if (top && (token = this.tokenizer.paragraph(cutSrc))) {
260+
if (this.state.top && (token = this.tokenizer.paragraph(cutSrc))) {
261261
lastToken = tokens[tokens.length - 1];
262262
if (lastParagraphClipped && lastToken.type === 'paragraph') {
263263
lastToken.raw += '\n' + token.raw;
264264
lastToken.text += '\n' + token.text;
265+
this.inlineQueue.pop();
266+
this.inlineQueue[this.inlineQueue.length - 1].src = lastToken.text;
265267
} else {
266268
tokens.push(token);
267269
}
@@ -277,6 +279,8 @@ module.exports = class Lexer {
277279
if (lastToken && lastToken.type === 'text') {
278280
lastToken.raw += '\n' + token.raw;
279281
lastToken.text += '\n' + token.text;
282+
this.inlineQueue.pop();
283+
this.inlineQueue[this.inlineQueue.length - 1].src = lastToken.text;
280284
} else {
281285
tokens.push(token);
282286
}
@@ -294,78 +298,18 @@ module.exports = class Lexer {
294298
}
295299
}
296300

301+
this.state.top = true;
297302
return tokens;
298303
}
299304

300-
inline(tokens) {
301-
let i,
302-
j,
303-
k,
304-
l2,
305-
row,
306-
token;
307-
308-
const l = tokens.length;
309-
for (i = 0; i < l; i++) {
310-
token = tokens[i];
311-
switch (token.type) {
312-
case 'paragraph':
313-
case 'text':
314-
case 'heading': {
315-
token.tokens = [];
316-
this.inlineTokens(token.text, token.tokens);
317-
break;
318-
}
319-
case 'table': {
320-
token.tokens = {
321-
header: [],
322-
cells: []
323-
};
324-
325-
// header
326-
l2 = token.header.length;
327-
for (j = 0; j < l2; j++) {
328-
token.tokens.header[j] = [];
329-
this.inlineTokens(token.header[j], token.tokens.header[j]);
330-
}
331-
332-
// cells
333-
l2 = token.cells.length;
334-
for (j = 0; j < l2; j++) {
335-
row = token.cells[j];
336-
token.tokens.cells[j] = [];
337-
for (k = 0; k < row.length; k++) {
338-
token.tokens.cells[j][k] = [];
339-
this.inlineTokens(row[k], token.tokens.cells[j][k]);
340-
}
341-
}
342-
343-
break;
344-
}
345-
case 'blockquote': {
346-
this.inline(token.tokens);
347-
break;
348-
}
349-
case 'list': {
350-
l2 = token.items.length;
351-
for (j = 0; j < l2; j++) {
352-
this.inline(token.items[j].tokens);
353-
}
354-
break;
355-
}
356-
default: {
357-
// do nothing
358-
}
359-
}
360-
}
361-
362-
return tokens;
305+
inline(src, tokens) {
306+
this.inlineQueue.push({ src, tokens });
363307
}
364308

365309
/**
366310
* Lexing/Compiling
367311
*/
368-
inlineTokens(src, tokens = [], inLink = false, inRawBlock = false) {
312+
inlineTokens(src, tokens = []) {
369313
let token, lastToken, cutSrc;
370314

371315
// String with links masked to avoid interference with em and strong
@@ -404,7 +348,7 @@ module.exports = class Lexer {
404348
if (this.options.extensions
405349
&& this.options.extensions.inline
406350
&& this.options.extensions.inline.some((extTokenizer) => {
407-
if (token = extTokenizer.call(this, src, tokens)) {
351+
if (token = extTokenizer.call({ lexer: this }, src, tokens)) {
408352
src = src.substring(token.raw.length);
409353
tokens.push(token);
410354
return true;
@@ -422,10 +366,8 @@ module.exports = class Lexer {
422366
}
423367

424368
// tag
425-
if (token = this.tokenizer.tag(src, inLink, inRawBlock)) {
369+
if (token = this.tokenizer.tag(src)) {
426370
src = src.substring(token.raw.length);
427-
inLink = token.inLink;
428-
inRawBlock = token.inRawBlock;
429371
lastToken = tokens[tokens.length - 1];
430372
if (lastToken && token.type === 'text' && lastToken.type === 'text') {
431373
lastToken.raw += token.raw;
@@ -439,9 +381,6 @@ module.exports = class Lexer {
439381
// link
440382
if (token = this.tokenizer.link(src)) {
441383
src = src.substring(token.raw.length);
442-
if (token.type === 'link') {
443-
token.tokens = this.inlineTokens(token.text, [], true, inRawBlock);
444-
}
445384
tokens.push(token);
446385
continue;
447386
}
@@ -450,10 +389,7 @@ module.exports = class Lexer {
450389
if (token = this.tokenizer.reflink(src, this.tokens.links)) {
451390
src = src.substring(token.raw.length);
452391
lastToken = tokens[tokens.length - 1];
453-
if (token.type === 'link') {
454-
token.tokens = this.inlineTokens(token.text, [], true, inRawBlock);
455-
tokens.push(token);
456-
} else if (lastToken && token.type === 'text' && lastToken.type === 'text') {
392+
if (lastToken && token.type === 'text' && lastToken.type === 'text') {
457393
lastToken.raw += token.raw;
458394
lastToken.text += token.text;
459395
} else {
@@ -465,7 +401,6 @@ module.exports = class Lexer {
465401
// em & strong
466402
if (token = this.tokenizer.emStrong(src, maskedSrc, prevChar)) {
467403
src = src.substring(token.raw.length);
468-
token.tokens = this.inlineTokens(token.text, [], inLink, inRawBlock);
469404
tokens.push(token);
470405
continue;
471406
}
@@ -487,7 +422,6 @@ module.exports = class Lexer {
487422
// del (gfm)
488423
if (token = this.tokenizer.del(src)) {
489424
src = src.substring(token.raw.length);
490-
token.tokens = this.inlineTokens(token.text, [], inLink, inRawBlock);
491425
tokens.push(token);
492426
continue;
493427
}
@@ -500,7 +434,7 @@ module.exports = class Lexer {
500434
}
501435

502436
// url (gfm)
503-
if (!inLink && (token = this.tokenizer.url(src, mangle))) {
437+
if (!this.state.inLink && (token = this.tokenizer.url(src, mangle))) {
504438
src = src.substring(token.raw.length);
505439
tokens.push(token);
506440
continue;
@@ -514,14 +448,14 @@ module.exports = class Lexer {
514448
const tempSrc = src.slice(1);
515449
let tempStart;
516450
this.options.extensions.startInline.forEach(function(getStartIndex) {
517-
tempStart = getStartIndex.call(this, tempSrc);
451+
tempStart = getStartIndex.call({ lexer: this }, tempSrc);
518452
if (typeof tempStart === 'number' && tempStart >= 0) { startIndex = Math.min(startIndex, tempStart); }
519453
});
520454
if (startIndex < Infinity && startIndex >= 0) {
521455
cutSrc = src.substring(0, startIndex + 1);
522456
}
523457
}
524-
if (token = this.tokenizer.inlineText(cutSrc, inRawBlock, smartypants)) {
458+
if (token = this.tokenizer.inlineText(cutSrc, smartypants)) {
525459
src = src.substring(token.raw.length);
526460
if (token.raw.slice(-1) !== '_') { // Track prevChar before string of ____ started
527461
prevChar = token.raw.slice(-1);

0 commit comments

Comments
 (0)