[api-minor] Reduce `postMessage` overhead, in `PartialEvaluator.getTextContent`, by sending text chunks in batches (issue 13962) #13977

Snuffleupagus · 2021-09-05T18:13:58Z

Following the STR in the issue, this patch reduces the number of PartialEvaluator.getTextContent-related postMessage-calls by approximately 78 percent.[1]
Note that by enforcing a relatively low value when batching text chunks, we should thus improve worst-case scenarios while not negatively affect all textLayer building.

While working on these changes I noticed, thanks to our unit-tests, that the implementation of the appendEOL function unfortunately means that the number and contents of the textItems could actually be affected by the particular chunking used.
That seems extremely unfortunate, since in practice this means that the particular chunking used is thus observable through the API. Obviously that should be a completely internal implementation detail, which is why this patch also modifies appendEOL to mitigate that.[2]

Given that this patch adds a minimum batch size in enqueueChunk, there's obviously nothing preventing it from becoming a lot larger then the limit (depending e.g. on the PDF structure and the CPU load/speed).
While sending more text chunks at once isn't an issue in itself, it could become problematic at the main-thread during textLayer building. Note how both the PartialEvaluator and CanvasGraphics implementations utilize Date.now()-checks, to prevent long-running parsing/rendering from "hanging" the respective thread. In the textLayer building we don't utilize such a construction[3], and streaming of textContent is thus essentially acting as a simple stand-in for that functionality.
Hence why we want to avoid choosing a too large minimum batch size, since that could thus indirectly affect main-thread performance negatively.

Improves and/or fixes #13962

[1] While it'd be possible to go even lower, that'd likely require more invasive re-factoring/changes to the PartialEvaluator.getTextContent-code to ensure that the batches don't become too large.

[2] This should also, as far as I can tell, explain some of the regressions observed in the "enhance" text-selection tests back in PR #13257.
Looking closer at the appendEOL function it should potentially be changed even more, however that should probably not be done here.

[3] I'd really like to avoid implementing something like that for the textLayer building as well, given that it'd require adding a fair bit of complexity.

src/core/evaluator.js

pdfjsbot · 2021-09-07T08:53:01Z

From: Bot.io (Linux m4)

Received

Command cmd_preview from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/3499b2436b2e25f/output.txt

pdfjsbot · 2021-09-07T08:57:28Z

From: Bot.io (Linux m4)

Success

Full output at http://54.241.84.105:8877/3499b2436b2e25f/output.txt

Total script time: 4.44 mins

Published

Viewer: http://54.241.84.105:8877/3499b2436b2e25f/web/viewer.html
Viewer (legacy): http://54.241.84.105:8877/3499b2436b2e25f/legacy/web/viewer.html

src/core/evaluator.js

…xtContent`, by sending text chunks in batches (issue 13962) Following the STR in the issue, this patch reduces the number of `PartialEvaluator.getTextContent`-related `postMessage`-calls by approximately 78 percent.[1] Note that by enforcing a relatively low value when batching text chunks, we should thus improve worst-case scenarios while not negatively affect all `textLayer` building. While working on these changes I noticed, thanks to our unit-tests, that the implementation of the `appendEOL` function unfortunately means that the number and content of the textItems could actually be affected by the particular chunking used. That seems *extremely* unfortunate, since in practice this means that the particular chunking used is thus observable through the API. Obviously that should be a completely internal implementation detail, which is why this patch also modifies `appendEOL` to mitigate that.[2] Given that this patch adds a *minimum* batch size in `enqueueChunk`, there's obviously nothing preventing it from becoming a lot larger then the limit (depending e.g. on the PDF structure and the CPU load/speed). While sending more text chunks at once isn't an issue in itself, it could become problematic at the main-thread during `textLayer` building. Note how both the `PartialEvaluator` and `CanvasGraphics` implementations utilize `Date.now()`-checks, to prevent long-running parsing/rendering from "hanging" the respective thread. In the `textLayer` building we don't utilize such a construction[3], and streaming of textContent is thus essentially acting as a *simple* stand-in for that functionality. Hence why we want to avoid choosing a too large minimum batch size, since that could thus indirectly affect main-thread performance negatively. --- [1] While it'd be possible to go even lower, that'd likely require more invasive re-factoring/changes to the `PartialEvaluator.getTextContent`-code to ensure that the batches don't become too large. [2] This should also, as far as I can tell, explain some of the regressions observed in the "enhance" text-selection tests back in PR 13257. Looking closer at the `appendEOL` function it should potentially be changed even more, however that should probably not be done here. [3] I'd really like to avoid implementing something like that for the `textLayer` building as well, given that it'd require adding a fair bit of complexity.

…xtContent"-handler The `MessageHandler`-implementation already handles either of these callbacks being undefined, hence there's no particular reason (as far as I can tell) to add no-op functions here. Also, in a couple of `MessageHandler`-methods, utilize an already existing local variable more.

Snuffleupagus · 2021-09-09T07:33:55Z

/botio test

pdfjsbot · 2021-09-09T07:33:56Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/30ce54cd4242922/output.txt

pdfjsbot · 2021-09-09T07:33:56Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/9b06e67d4ef85b6/output.txt

pdfjsbot · 2021-09-09T07:55:57Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.241.84.105:8877/30ce54cd4242922/output.txt

Total script time: 22.01 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: Passed
Regression tests: FAILED

  different ref/snapshot: 13

Image differences available at: http://54.241.84.105:8877/30ce54cd4242922/reftest-analyzer.html#web=eq.log

pdfjsbot · 2021-09-09T08:13:27Z

From: Bot.io (Windows)

Failed

Full output at http://54.193.163.58:8877/9b06e67d4ef85b6/output.txt

Total script time: 39.49 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: Passed
Regression tests: FAILED

  different ref/snapshot: 10
  different first/second rendering: 1

Image differences available at: http://54.193.163.58:8877/9b06e67d4ef85b6/reftest-analyzer.html#web=eq.log

timvandermeij · 2021-09-11T11:27:05Z

/botio-linux preview

pdfjsbot · 2021-09-11T11:27:07Z

From: Bot.io (Linux m4)

Received

Command cmd_preview from @timvandermeij received. Current queue size: 0

Live output at: http://54.241.84.105:8877/a38a03a3af1a951/output.txt

pdfjsbot · 2021-09-11T11:31:48Z

From: Bot.io (Linux m4)

Success

Full output at http://54.241.84.105:8877/a38a03a3af1a951/output.txt

Total script time: 4.67 mins

Published

Viewer: http://54.241.84.105:8877/a38a03a3af1a951/web/viewer.html
Viewer (legacy): http://54.241.84.105:8877/a38a03a3af1a951/legacy/web/viewer.html

timvandermeij · 2021-09-11T11:34:14Z

Nice work!

Snuffleupagus · 2021-09-11T12:00:24Z

@timvandermeij As always, thanks a lot for reviews :-)

Off-topic: How do you feel about a new release soon, since there's been a fair number of api-minor changes recently and also new features such as the just landed /Interpolate-support (fixing many old issues)?

Snuffleupagus added the text-selection label Sep 5, 2021

Snuffleupagus commented Sep 5, 2021

View reviewed changes

src/core/evaluator.js Outdated Show resolved Hide resolved

Snuffleupagus force-pushed the enqueueChunk-batch branch from c4602e6 to e47b6a9 Compare September 7, 2021 08:44

mozilla deleted a comment from pdfjsbot Sep 7, 2021

timvandermeij reviewed Sep 8, 2021

View reviewed changes

src/core/evaluator.js Outdated Show resolved Hide resolved

src/core/evaluator.js Show resolved Hide resolved

Snuffleupagus added 2 commits September 9, 2021 00:01

Snuffleupagus force-pushed the enqueueChunk-batch branch from e47b6a9 to 45ddb12 Compare September 8, 2021 22:01

timvandermeij approved these changes Sep 11, 2021

View reviewed changes

timvandermeij merged commit e97f01b into mozilla:master Sep 11, 2021

Snuffleupagus deleted the enqueueChunk-batch branch September 11, 2021 11:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[api-minor] Reduce `postMessage` overhead, in `PartialEvaluator.getTextContent`, by sending text chunks in batches (issue 13962) #13977

[api-minor] Reduce `postMessage` overhead, in `PartialEvaluator.getTextContent`, by sending text chunks in batches (issue 13962) #13977

Snuffleupagus commented Sep 5, 2021 •

edited

Loading

pdfjsbot commented Sep 7, 2021

pdfjsbot commented Sep 7, 2021

Snuffleupagus commented Sep 9, 2021

pdfjsbot commented Sep 9, 2021

pdfjsbot commented Sep 9, 2021

pdfjsbot commented Sep 9, 2021

pdfjsbot commented Sep 9, 2021

timvandermeij commented Sep 11, 2021

pdfjsbot commented Sep 11, 2021

pdfjsbot commented Sep 11, 2021

timvandermeij commented Sep 11, 2021

Snuffleupagus commented Sep 11, 2021 •

edited

Loading

[api-minor] Reduce postMessage overhead, in PartialEvaluator.getTextContent, by sending text chunks in batches (issue 13962) #13977

[api-minor] Reduce postMessage overhead, in PartialEvaluator.getTextContent, by sending text chunks in batches (issue 13962) #13977

Conversation

Snuffleupagus commented Sep 5, 2021 • edited Loading

pdfjsbot commented Sep 7, 2021

From: Bot.io (Linux m4)

Received

pdfjsbot commented Sep 7, 2021

From: Bot.io (Linux m4)

Success

Published

Snuffleupagus commented Sep 9, 2021

pdfjsbot commented Sep 9, 2021

From: Bot.io (Linux m4)

Received

pdfjsbot commented Sep 9, 2021

From: Bot.io (Windows)

Received

pdfjsbot commented Sep 9, 2021

From: Bot.io (Linux m4)

Failed

pdfjsbot commented Sep 9, 2021

From: Bot.io (Windows)

Failed

timvandermeij commented Sep 11, 2021

pdfjsbot commented Sep 11, 2021

From: Bot.io (Linux m4)

Received

pdfjsbot commented Sep 11, 2021

From: Bot.io (Linux m4)

Success

Published

timvandermeij commented Sep 11, 2021

Snuffleupagus commented Sep 11, 2021 • edited Loading

[api-minor] Reduce `postMessage` overhead, in `PartialEvaluator.getTextContent`, by sending text chunks in batches (issue 13962) #13977

[api-minor] Reduce `postMessage` overhead, in `PartialEvaluator.getTextContent`, by sending text chunks in batches (issue 13962) #13977

Snuffleupagus commented Sep 5, 2021 •

edited

Loading

Snuffleupagus commented Sep 11, 2021 •

edited

Loading