-
Notifications
You must be signed in to change notification settings - Fork 10.3k
[api-minor] Replace PDFDocumentProxy.getStats
with a synchronous PDFDocumentProxy.stats
getter
#14294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…DFDocumentProxy.stats` getter *Please note:* These changes will primarily benefit longer documents, somewhat at the expense of e.g. one-page documents. The existing `PDFDocumentProxy.getStats` function, which in the default viewer is called for each rendered page, requires a round-trip to the worker-thread in order to obtain the current document stats. In the default viewer, we currently make one such API-call for *every rendered* page. This patch proposes replacing that method with a *synchronous* `PDFDocumentProxy.stats` getter instead, combined with re-factoring the worker-thread code by adding a `DocStats`-class to track Stream/Font-types and *only send* them to the main-thread *the first time* that a type is encountered. Note that in practice most PDF documents only use a fairly limited number of Stream/Font-types, which means that in longer documents most of the `PDFDocumentProxy.getStats`-calls will return the same data.[1] This re-factoring will obviously benefit longer document the most[2], and could actually be seen as a regression for one-page documents, since in practice there'll usually be a couple of "DocStats" messages sent during the parsing of the first page. However, if the user zooms/rotates the document (which causes re-rendering), note that even a one-page document would start to benefit from these changes. Another benefit of having the data available/cached in the API is that unless the document stats change during parsing, repeated `PDFDocumentProxy.stats`-calls will return *the same identical* object. This is something that we can easily take advantage of in the default viewer, by now *only* reporting "documentStats" telemetry[3] when the data actually have changed rather than once per rendered page (again beneficial in longer documents). --- [1] Furthermore, the maximium number of `StreamType`/`FontType` are `10` respectively `12`, which means that regardless of the complexity and page count in a PDF document there'll never be more than twenty-two "DocStats" messages sent; see https://github.com/mozilla/pdf.js/blob/41ac3f0c07128bf34baccdcc067a108c712fd6ef/src/shared/util.js#L206-L232 [2] One example is the `pdf.pdf` document in the test-suite, where rendering all of its 1310 pages only result in a total of seven "DocStats" messages being sent from the worker-thread. [3] Reporting telemetry, in Firefox, includes using `JSON.stringify` on the data and then sending an event to the `PdfStreamConverter.jsm`-code. In that code the event is handled and `JSON.parse` is used to retrieve the data, and in the "documentStats"-case we'll then iterate through the data to avoid double-reporting telemetry; see https://searchfox.org/mozilla-central/rev/8f4c180b87e52f3345ef8a3432d6e54bd1eb18dc/toolkit/components/pdfjs/content/PdfStreamConverter.jsm#515-549
/botio test |
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.241.84.105:8877/58c84ed279390fa/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.193.163.58:8877/fa5458fcec2cd73/output.txt |
From: Bot.io (Linux m4)FailedFull output at http://54.241.84.105:8877/58c84ed279390fa/output.txt Total script time: 21.82 mins
Image differences available at: http://54.241.84.105:8877/58c84ed279390fa/reftest-analyzer.html#web=eq.log |
From: Bot.io (Windows)FailedFull output at http://54.193.163.58:8877/fa5458fcec2cd73/output.txt Total script time: 41.98 mins
Image differences available at: http://54.193.163.58:8877/fa5458fcec2cd73/reftest-analyzer.html#web=eq.log |
Looks really good; thank you for doing this! |
Please note: These changes will primarily benefit longer documents, somewhat at the expense of e.g. one-page documents.
The existing
PDFDocumentProxy.getStats
function, which in the default viewer is called for each rendered page, requires a round-trip to the worker-thread in order to obtain the current document stats. In the default viewer, we currently make one such API-call for every rendered page.This patch proposes replacing that method with a synchronous
PDFDocumentProxy.stats
getter instead, combined with re-factoring the worker-thread code by adding aDocStats
-class to track Stream/Font-types and only send them to the main-thread the first time that a type is encountered.Note that in practice most PDF documents only use a fairly limited number of Stream/Font-types, which means that in longer documents most of the
PDFDocumentProxy.getStats
-calls will return the same data.[1]This re-factoring will obviously benefit longer document the most[2], and could actually be seen as a regression for one-page documents, since in practice there'll usually be a couple of "DocStats" messages sent during the parsing of the first page. However, if the user zooms/rotates the document (which causes re-rendering), note that even a one-page document would start to benefit from these changes.
Another benefit of having the data available/cached in the API is that unless the document stats change during parsing, repeated
PDFDocumentProxy.stats
-calls will return the same identical object.This is something that we can easily take advantage of in the default viewer, by now only reporting "documentStats" telemetry[3] when the data actually have changed rather than once per rendered page (again beneficial in longer documents).
[1] Furthermore, the maximium number of
StreamType
/FontType
are10
respectively12
, which means that regardless of the complexity and page count in a PDF document there'll never be more than twenty-two "DocStats" messages sent; seepdf.js/src/shared/util.js
Lines 206 to 232 in 41ac3f0
[2] One example is the
pdf.pdf
document in the test-suite, where rendering all of its 1310 pages only result in a total of seven "DocStats" messages being sent from the worker-thread.[3] Reporting telemetry, in Firefox, includes using
JSON.stringify
on the data and then sending an event to thePdfStreamConverter.jsm
-code.In that code the event is handled and
JSON.parse
is used to retrieve the data, and in the "documentStats"-case we'll then iterate through the data to avoid double-reporting telemetry; see https://searchfox.org/mozilla-central/rev/8f4c180b87e52f3345ef8a3432d6e54bd1eb18dc/toolkit/components/pdfjs/content/PdfStreamConverter.jsm#515-549