Skip to content

Avoid overloading the worker-thread during eager page initialization in the viewer (PR 11263 follow-up) #14359

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 11, 2021

Conversation

Snuffleupagus
Copy link
Collaborator

@Snuffleupagus Snuffleupagus commented Dec 10, 2021

This patch is essentially another continuation of PR #11263, which tried to improve loading/initialization performance of very large/long documents.

For most documents, unless they're very long, we'll eagerly initialize all of the pages in the viewer. For shorter documents having all pages loaded/initialized early provides overall better performance/UX in the viewer, however there's cases where it can instead hurt performance.
For documents with a couple of thousand pages[1], the parsing and pre-rendering of the second page of the document can be delayed (quite a bit). The reason for this is that we trigger PDFDocumentProxy.getPage for all pages early during the viewer initialization, which causes the worker-thread to be swamped with handling (potentially) thousands of getPage-calls and leaving very little time for other parsing (such as e.g. of operatorLists).

To address this situation, this patch thus proposes temporarily "pausing" the eager PDFDocumentProxy.getPage-calls once a threshold has been reached, to give the worker-thread a change to handle other requests.[2]

Obviously this may slightly delay the "pagesloaded" event in longer documents, but considering that it's already the result of asynchronous parsing that'll hopefully not be seen as a blocker for these changes.[3]


[1] A particularly problematic example is https://github.com/mozilla/pdf.js/files/876321/kjv.pdf (16 MB large), which is a document with 2236 pages and a /Pages-tree that's only one level deep.

[2] Please note that I initially considered simply chaining the PDFDocumentProxy.getPage-calls, however that'd slowed things down for all documents which didn't seem appropriate.

[3] This patch will hopefully also make it possible to re-visit PR #11312, since it seems that changing Catalog.getPageDict to an async method wasn't the problem in itself. Rather it appears that it leads to slightly different timings, thus exacerbating the already existing issues with the worker-thread being overloaded by getPage-calls.
Having recently worked with that method, there's a couple of (very old) issues that I'd also like to address and having Catalog.getPageDict be async would simplify things a great deal.

…in the viewer (PR 11263 follow-up)

This patch is essentially *another* continuation of PR 11263, which tried to improve loading/initialization performance of *very* large/long documents.

For most documents, unless they're *very* long, we'll eagerly initialize all of the pages in the viewer. For shorter documents having all pages loaded/initialized early provides overall better performance/UX in the viewer, however there's cases where it can instead *hurt* performance.
For documents with a couple of thousand pages[1], the parsing and pre-rendering of the *second* page of the document can be delayed (quite a bit). The reason for this is that we trigger `PDFDocumentProxy.getPage` for *all pages* early during the viewer initialization, which causes the worker-thread to be swamped with handling (potentially) thousands of `getPage`-calls and leaving very little time for other parsing (such as e.g. of operatorLists).

To address this situation, this patch thus proposes temporarily "pausing" the eager `PDFDocumentProxy.getPage`-calls once a threshold has been reached, to give the worker-thread a change to handle other requests.[2]

Obviously this may *slightly* delay the "pagesloaded" event in longer documents, but considering that it's already the result of asynchronous parsing that'll hopefully not be seen as a blocker for these changes.[3]

---
[1] A particularly problematic example is https://github.com/mozilla/pdf.js/files/876321/kjv.pdf (16 MB large), which is a document with 2236 pages and a /Pages-tree that's only *one* level deep.

[2] Please note that I initially considered simply chaining the `PDFDocumentProxy.getPage`-calls, however that'd slowed things down for all documents which didn't seem appropriate.

[3] This patch will *hopefully* also make it possible to re-visit PR 11312, since it seems that changing `Catalog.getPageDict` to an `async` method wasn't the problem in itself. Rather it appears that it leads to slightly different timings, thus exacerbating the already existing issues with the worker-thread being overloaded by `getPage`-calls.
Having recently worked with that method, there's a couple of (very old) issues that I'd also like to address and having `Catalog.getPageDict` be `async` would simplify things a great deal.
@Snuffleupagus
Copy link
Collaborator Author

/botio integrationtest

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_integrationtest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/3e2c77f0b7067e6/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_integrationtest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/1e31515c509e783/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.241.84.105:8877/1e31515c509e783/output.txt

Total script time: 2.49 mins

  • Integration Tests: FAILED

@pdfjsbot
Copy link

From: Bot.io (Windows)


Success

Full output at http://54.193.163.58:8877/3e2c77f0b7067e6/output.txt

Total script time: 5.42 mins

  • Integration Tests: Passed

@timvandermeij
Copy link
Contributor

/botio-linux preview

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_preview from @timvandermeij received. Current queue size: 0

Live output at: http://54.241.84.105:8877/d23d305efb4aef5/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.241.84.105:8877/d23d305efb4aef5/output.txt

Total script time: 4.24 mins

Published

@timvandermeij timvandermeij merged commit 3a8318a into mozilla:master Dec 11, 2021
@timvandermeij
Copy link
Contributor

Let's do this; thanks!

@Snuffleupagus Snuffleupagus deleted the PAUSE_EAGER_PAGE_INIT branch December 11, 2021 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants