Skip to content

[api-minor] Reduce postMessage overhead, in PartialEvaluator.getTextContent, by sending text chunks in batches (issue 13962) #13977

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 11, 2021

Conversation

Snuffleupagus
Copy link
Collaborator

@Snuffleupagus Snuffleupagus commented Sep 5, 2021

Following the STR in the issue, this patch reduces the number of PartialEvaluator.getTextContent-related postMessage-calls by approximately 78 percent.[1]
Note that by enforcing a relatively low value when batching text chunks, we should thus improve worst-case scenarios while not negatively affect all textLayer building.

While working on these changes I noticed, thanks to our unit-tests, that the implementation of the appendEOL function unfortunately means that the number and contents of the textItems could actually be affected by the particular chunking used.
That seems extremely unfortunate, since in practice this means that the particular chunking used is thus observable through the API. Obviously that should be a completely internal implementation detail, which is why this patch also modifies appendEOL to mitigate that.[2]

Given that this patch adds a minimum batch size in enqueueChunk, there's obviously nothing preventing it from becoming a lot larger then the limit (depending e.g. on the PDF structure and the CPU load/speed).
While sending more text chunks at once isn't an issue in itself, it could become problematic at the main-thread during textLayer building. Note how both the PartialEvaluator and CanvasGraphics implementations utilize Date.now()-checks, to prevent long-running parsing/rendering from "hanging" the respective thread. In the textLayer building we don't utilize such a construction[3], and streaming of textContent is thus essentially acting as a simple stand-in for that functionality.
Hence why we want to avoid choosing a too large minimum batch size, since that could thus indirectly affect main-thread performance negatively.

Improves and/or fixes #13962


[1] While it'd be possible to go even lower, that'd likely require more invasive re-factoring/changes to the PartialEvaluator.getTextContent-code to ensure that the batches don't become too large.

[2] This should also, as far as I can tell, explain some of the regressions observed in the "enhance" text-selection tests back in PR #13257.
Looking closer at the appendEOL function it should potentially be changed even more, however that should probably not be done here.

[3] I'd really like to avoid implementing something like that for the textLayer building as well, given that it'd require adding a fair bit of complexity.

@mozilla mozilla deleted a comment from pdfjsbot Sep 7, 2021
@mozilla mozilla deleted a comment from pdfjsbot Sep 7, 2021
@pdfjsbot
Copy link

pdfjsbot commented Sep 7, 2021

From: Bot.io (Linux m4)


Received

Command cmd_preview from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/3499b2436b2e25f/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Sep 7, 2021

From: Bot.io (Linux m4)


Success

Full output at http://54.241.84.105:8877/3499b2436b2e25f/output.txt

Total script time: 4.44 mins

Published

…xtContent`, by sending text chunks in batches (issue 13962)

Following the STR in the issue, this patch reduces the number of `PartialEvaluator.getTextContent`-related `postMessage`-calls by approximately 78 percent.[1]
Note that by enforcing a relatively low value when batching text chunks, we should thus improve worst-case scenarios while not negatively affect all `textLayer` building.

While working on these changes I noticed, thanks to our unit-tests, that the implementation of the `appendEOL` function unfortunately means that the number and content of the textItems could actually be affected by the particular chunking used.
That seems *extremely* unfortunate, since in practice this means that the particular chunking used is thus observable through the API. Obviously that should be a completely internal implementation detail, which is why this patch also modifies `appendEOL` to mitigate that.[2]

Given that this patch adds a *minimum* batch size in `enqueueChunk`, there's obviously nothing preventing it from becoming a lot larger then the limit (depending e.g. on the PDF structure and the CPU load/speed).
While sending more text chunks at once isn't an issue in itself, it could become problematic at the main-thread during `textLayer` building. Note how both the `PartialEvaluator` and `CanvasGraphics` implementations utilize `Date.now()`-checks, to prevent long-running parsing/rendering from "hanging" the respective thread. In the `textLayer` building we don't utilize such a construction[3], and streaming of textContent is thus essentially acting as a *simple* stand-in for that functionality.
Hence why we want to avoid choosing a too large minimum batch size, since that could thus indirectly affect main-thread performance negatively.

---
[1] While it'd be possible to go even lower, that'd likely require more invasive re-factoring/changes to the `PartialEvaluator.getTextContent`-code to ensure that the batches don't become too large.

[2] This should also, as far as I can tell, explain some of the regressions observed in the "enhance" text-selection tests back in PR 13257.
    Looking closer at the `appendEOL` function it should potentially be changed even more, however that should probably not be done here.

[3] I'd really like to avoid implementing something like that for the `textLayer` building as well, given that it'd require adding a fair bit of complexity.
…xtContent"-handler

The `MessageHandler`-implementation already handles either of these callbacks being undefined, hence there's no particular reason (as far as I can tell) to add no-op functions here.

Also, in a couple of `MessageHandler`-methods, utilize an already existing local variable more.
@Snuffleupagus
Copy link
Collaborator Author

/botio test

@pdfjsbot
Copy link

pdfjsbot commented Sep 9, 2021

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/30ce54cd4242922/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Sep 9, 2021

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/9b06e67d4ef85b6/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Sep 9, 2021

From: Bot.io (Linux m4)


Failed

Full output at http://54.241.84.105:8877/30ce54cd4242922/output.txt

Total script time: 22.01 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED
  different ref/snapshot: 13

Image differences available at: http://54.241.84.105:8877/30ce54cd4242922/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

pdfjsbot commented Sep 9, 2021

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/9b06e67d4ef85b6/output.txt

Total script time: 39.49 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED
  different ref/snapshot: 10
  different first/second rendering: 1

Image differences available at: http://54.193.163.58:8877/9b06e67d4ef85b6/reftest-analyzer.html#web=eq.log

@timvandermeij
Copy link
Contributor

/botio-linux preview

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_preview from @timvandermeij received. Current queue size: 0

Live output at: http://54.241.84.105:8877/a38a03a3af1a951/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.241.84.105:8877/a38a03a3af1a951/output.txt

Total script time: 4.67 mins

Published

@timvandermeij timvandermeij merged commit e97f01b into mozilla:master Sep 11, 2021
@timvandermeij
Copy link
Contributor

Nice work!

@Snuffleupagus Snuffleupagus deleted the enqueueChunk-batch branch September 11, 2021 11:55
@Snuffleupagus
Copy link
Collaborator Author

Snuffleupagus commented Sep 11, 2021

@timvandermeij As always, thanks a lot for reviews :-)


Off-topic: How do you feel about a new release soon, since there's been a fair number of api-minor changes recently and also new features such as the just landed /Interpolate-support (fixing many old issues)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slow text extraction in large documents due to cross-thread messaging overhead
3 participants