Avoid overloading the worker-thread during eager page initialization in the viewer (PR 11263 follow-up) #14359

Snuffleupagus · 2021-12-10T19:42:39Z

This patch is essentially another continuation of PR #11263, which tried to improve loading/initialization performance of very large/long documents.

For most documents, unless they're very long, we'll eagerly initialize all of the pages in the viewer. For shorter documents having all pages loaded/initialized early provides overall better performance/UX in the viewer, however there's cases where it can instead hurt performance.
For documents with a couple of thousand pages[1], the parsing and pre-rendering of the second page of the document can be delayed (quite a bit). The reason for this is that we trigger PDFDocumentProxy.getPage for all pages early during the viewer initialization, which causes the worker-thread to be swamped with handling (potentially) thousands of getPage-calls and leaving very little time for other parsing (such as e.g. of operatorLists).

To address this situation, this patch thus proposes temporarily "pausing" the eager PDFDocumentProxy.getPage-calls once a threshold has been reached, to give the worker-thread a change to handle other requests.[2]

Obviously this may slightly delay the "pagesloaded" event in longer documents, but considering that it's already the result of asynchronous parsing that'll hopefully not be seen as a blocker for these changes.[3]

[1] A particularly problematic example is https://github.com/mozilla/pdf.js/files/876321/kjv.pdf (16 MB large), which is a document with 2236 pages and a /Pages-tree that's only one level deep.

[2] Please note that I initially considered simply chaining the PDFDocumentProxy.getPage-calls, however that'd slowed things down for all documents which didn't seem appropriate.

[3] This patch will hopefully also make it possible to re-visit PR #11312, since it seems that changing Catalog.getPageDict to an async method wasn't the problem in itself. Rather it appears that it leads to slightly different timings, thus exacerbating the already existing issues with the worker-thread being overloaded by getPage-calls.
Having recently worked with that method, there's a couple of (very old) issues that I'd also like to address and having Catalog.getPageDict be async would simplify things a great deal.

…in the viewer (PR 11263 follow-up) This patch is essentially *another* continuation of PR 11263, which tried to improve loading/initialization performance of *very* large/long documents. For most documents, unless they're *very* long, we'll eagerly initialize all of the pages in the viewer. For shorter documents having all pages loaded/initialized early provides overall better performance/UX in the viewer, however there's cases where it can instead *hurt* performance. For documents with a couple of thousand pages[1], the parsing and pre-rendering of the *second* page of the document can be delayed (quite a bit). The reason for this is that we trigger `PDFDocumentProxy.getPage` for *all pages* early during the viewer initialization, which causes the worker-thread to be swamped with handling (potentially) thousands of `getPage`-calls and leaving very little time for other parsing (such as e.g. of operatorLists). To address this situation, this patch thus proposes temporarily "pausing" the eager `PDFDocumentProxy.getPage`-calls once a threshold has been reached, to give the worker-thread a change to handle other requests.[2] Obviously this may *slightly* delay the "pagesloaded" event in longer documents, but considering that it's already the result of asynchronous parsing that'll hopefully not be seen as a blocker for these changes.[3] --- [1] A particularly problematic example is https://github.com/mozilla/pdf.js/files/876321/kjv.pdf (16 MB large), which is a document with 2236 pages and a /Pages-tree that's only *one* level deep. [2] Please note that I initially considered simply chaining the `PDFDocumentProxy.getPage`-calls, however that'd slowed things down for all documents which didn't seem appropriate. [3] This patch will *hopefully* also make it possible to re-visit PR 11312, since it seems that changing `Catalog.getPageDict` to an `async` method wasn't the problem in itself. Rather it appears that it leads to slightly different timings, thus exacerbating the already existing issues with the worker-thread being overloaded by `getPage`-calls. Having recently worked with that method, there's a couple of (very old) issues that I'd also like to address and having `Catalog.getPageDict` be `async` would simplify things a great deal.

Snuffleupagus · 2021-12-10T19:46:28Z

/botio integrationtest

pdfjsbot · 2021-12-10T19:46:29Z

From: Bot.io (Windows)

Received

Command cmd_integrationtest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/3e2c77f0b7067e6/output.txt

pdfjsbot · 2021-12-10T19:46:29Z

From: Bot.io (Linux m4)

Received

Command cmd_integrationtest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/1e31515c509e783/output.txt

pdfjsbot · 2021-12-10T19:48:59Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.241.84.105:8877/1e31515c509e783/output.txt

Total script time: 2.49 mins

Integration Tests: FAILED

pdfjsbot · 2021-12-10T19:51:55Z

From: Bot.io (Windows)

Success

Full output at http://54.193.163.58:8877/3e2c77f0b7067e6/output.txt

Total script time: 5.42 mins

Integration Tests: Passed

timvandermeij · 2021-12-11T12:10:23Z

/botio-linux preview

pdfjsbot · 2021-12-11T12:10:24Z

From: Bot.io (Linux m4)

Received

Command cmd_preview from @timvandermeij received. Current queue size: 0

Live output at: http://54.241.84.105:8877/d23d305efb4aef5/output.txt

pdfjsbot · 2021-12-11T12:14:40Z

From: Bot.io (Linux m4)

Success

Full output at http://54.241.84.105:8877/d23d305efb4aef5/output.txt

Total script time: 4.24 mins

Published

Viewer: http://54.241.84.105:8877/d23d305efb4aef5/web/viewer.html
Viewer (legacy): http://54.241.84.105:8877/d23d305efb4aef5/legacy/web/viewer.html

timvandermeij · 2021-12-11T12:28:42Z

Let's do this; thanks!

Snuffleupagus added the viewer label Dec 10, 2021

Snuffleupagus force-pushed the PAUSE_EAGER_PAGE_INIT branch from 0c9a6c2 to 90472e5 Compare December 10, 2021 19:44

timvandermeij approved these changes Dec 11, 2021

View reviewed changes

timvandermeij merged commit 3a8318a into mozilla:master Dec 11, 2021

Snuffleupagus deleted the PAUSE_EAGER_PAGE_INIT branch December 11, 2021 15:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid overloading the worker-thread during eager page initialization in the viewer (PR 11263 follow-up) #14359

Avoid overloading the worker-thread during eager page initialization in the viewer (PR 11263 follow-up) #14359

Uh oh!

Snuffleupagus commented Dec 10, 2021 •

edited

Loading

Uh oh!

Snuffleupagus commented Dec 10, 2021

Uh oh!

pdfjsbot commented Dec 10, 2021

Uh oh!

pdfjsbot commented Dec 10, 2021

Uh oh!

pdfjsbot commented Dec 10, 2021

Uh oh!

pdfjsbot commented Dec 10, 2021

Uh oh!

timvandermeij commented Dec 11, 2021

Uh oh!

pdfjsbot commented Dec 11, 2021

Uh oh!

pdfjsbot commented Dec 11, 2021

Uh oh!

timvandermeij commented Dec 11, 2021

Uh oh!

Uh oh!

Avoid overloading the worker-thread during eager page initialization in the viewer (PR 11263 follow-up) #14359

Avoid overloading the worker-thread during eager page initialization in the viewer (PR 11263 follow-up) #14359

Uh oh!

Conversation

Snuffleupagus commented Dec 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Snuffleupagus commented Dec 10, 2021

Uh oh!

pdfjsbot commented Dec 10, 2021

From: Bot.io (Windows)

Received

Uh oh!

pdfjsbot commented Dec 10, 2021

From: Bot.io (Linux m4)

Received

Uh oh!

pdfjsbot commented Dec 10, 2021

From: Bot.io (Linux m4)

Failed

Uh oh!

pdfjsbot commented Dec 10, 2021

From: Bot.io (Windows)

Success

Uh oh!

timvandermeij commented Dec 11, 2021

Uh oh!

pdfjsbot commented Dec 11, 2021

From: Bot.io (Linux m4)

Received

Uh oh!

pdfjsbot commented Dec 11, 2021

From: Bot.io (Linux m4)

Success

Published

Uh oh!

timvandermeij commented Dec 11, 2021

Uh oh!

Uh oh!

Snuffleupagus commented Dec 10, 2021 •

edited

Loading