Skip to content

[api-minor] Replace PDFDocumentProxy.getStats with a synchronous PDFDocumentProxy.stats getter #14294

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 20, 2021

Conversation

Snuffleupagus
Copy link
Collaborator

Please note: These changes will primarily benefit longer documents, somewhat at the expense of e.g. one-page documents.

The existing PDFDocumentProxy.getStats function, which in the default viewer is called for each rendered page, requires a round-trip to the worker-thread in order to obtain the current document stats. In the default viewer, we currently make one such API-call for every rendered page.
This patch proposes replacing that method with a synchronous PDFDocumentProxy.stats getter instead, combined with re-factoring the worker-thread code by adding a DocStats-class to track Stream/Font-types and only send them to the main-thread the first time that a type is encountered.

Note that in practice most PDF documents only use a fairly limited number of Stream/Font-types, which means that in longer documents most of the PDFDocumentProxy.getStats-calls will return the same data.[1]
This re-factoring will obviously benefit longer document the most[2], and could actually be seen as a regression for one-page documents, since in practice there'll usually be a couple of "DocStats" messages sent during the parsing of the first page. However, if the user zooms/rotates the document (which causes re-rendering), note that even a one-page document would start to benefit from these changes.

Another benefit of having the data available/cached in the API is that unless the document stats change during parsing, repeated PDFDocumentProxy.stats-calls will return the same identical object.
This is something that we can easily take advantage of in the default viewer, by now only reporting "documentStats" telemetry[3] when the data actually have changed rather than once per rendered page (again beneficial in longer documents).


[1] Furthermore, the maximium number of StreamType/FontType are 10 respectively 12, which means that regardless of the complexity and page count in a PDF document there'll never be more than twenty-two "DocStats" messages sent; see

pdf.js/src/shared/util.js

Lines 206 to 232 in 41ac3f0

const StreamType = {
UNKNOWN: "UNKNOWN",
FLATE: "FLATE",
LZW: "LZW",
DCT: "DCT",
JPX: "JPX",
JBIG: "JBIG",
A85: "A85",
AHX: "AHX",
CCF: "CCF",
RLX: "RLX", // PDF short name is 'RL', but telemetry requires three chars.
};
const FontType = {
UNKNOWN: "UNKNOWN",
TYPE1: "TYPE1",
TYPE1STANDARD: "TYPE1STANDARD",
TYPE1C: "TYPE1C",
CIDFONTTYPE0: "CIDFONTTYPE0",
CIDFONTTYPE0C: "CIDFONTTYPE0C",
TRUETYPE: "TRUETYPE",
CIDFONTTYPE2: "CIDFONTTYPE2",
TYPE3: "TYPE3",
OPENTYPE: "OPENTYPE",
TYPE0: "TYPE0",
MMTYPE1: "MMTYPE1",
};

[2] One example is the pdf.pdf document in the test-suite, where rendering all of its 1310 pages only result in a total of seven "DocStats" messages being sent from the worker-thread.

[3] Reporting telemetry, in Firefox, includes using JSON.stringify on the data and then sending an event to the PdfStreamConverter.jsm-code.
In that code the event is handled and JSON.parse is used to retrieve the data, and in the "documentStats"-case we'll then iterate through the data to avoid double-reporting telemetry; see https://searchfox.org/mozilla-central/rev/8f4c180b87e52f3345ef8a3432d6e54bd1eb18dc/toolkit/components/pdfjs/content/PdfStreamConverter.jsm#515-549

…DFDocumentProxy.stats` getter

*Please note:* These changes will primarily benefit longer documents, somewhat at the expense of e.g. one-page documents.

The existing `PDFDocumentProxy.getStats` function, which in the default viewer is called for each rendered page, requires a round-trip to the worker-thread in order to obtain the current document stats. In the default viewer, we currently make one such API-call for *every rendered* page.
This patch proposes replacing that method with a *synchronous* `PDFDocumentProxy.stats` getter instead, combined with re-factoring the worker-thread code by adding a `DocStats`-class to track Stream/Font-types and *only send* them to the main-thread *the first time* that a type is encountered.

Note that in practice most PDF documents only use a fairly limited number of Stream/Font-types, which means that in longer documents most of the `PDFDocumentProxy.getStats`-calls will return the same data.[1]
This re-factoring will obviously benefit longer document the most[2], and could actually be seen as a regression for one-page documents, since in practice there'll usually be a couple of "DocStats" messages sent during the parsing of the first page. However, if the user zooms/rotates the document (which causes re-rendering), note that even a one-page document would start to benefit from these changes.

Another benefit of having the data available/cached in the API is that unless the document stats change during parsing, repeated `PDFDocumentProxy.stats`-calls will return *the same identical* object.
This is something that we can easily take advantage of in the default viewer, by now *only* reporting "documentStats" telemetry[3] when the data actually have changed rather than once per rendered page (again beneficial in longer documents).

---
[1] Furthermore, the maximium number of `StreamType`/`FontType` are `10` respectively `12`, which means that regardless of the complexity and page count in a PDF document there'll never be more than twenty-two "DocStats" messages sent; see https://github.com/mozilla/pdf.js/blob/41ac3f0c07128bf34baccdcc067a108c712fd6ef/src/shared/util.js#L206-L232

[2] One example is the `pdf.pdf` document in the test-suite, where rendering all of its 1310 pages only result in a total of seven "DocStats" messages being sent from the worker-thread.

[3] Reporting telemetry, in Firefox, includes using `JSON.stringify` on the data and then sending an event to the `PdfStreamConverter.jsm`-code.
In that code the event is handled and `JSON.parse` is used to retrieve the data, and in the "documentStats"-case we'll then iterate through the data to avoid double-reporting telemetry; see https://searchfox.org/mozilla-central/rev/8f4c180b87e52f3345ef8a3432d6e54bd1eb18dc/toolkit/components/pdfjs/content/PdfStreamConverter.jsm#515-549
@Snuffleupagus
Copy link
Collaborator Author

/botio test

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/58c84ed279390fa/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/fa5458fcec2cd73/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.241.84.105:8877/58c84ed279390fa/output.txt

Total script time: 21.82 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: FAILED
  • Regression tests: FAILED
  different ref/snapshot: 9
  different first/second rendering: 1

Image differences available at: http://54.241.84.105:8877/58c84ed279390fa/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/fa5458fcec2cd73/output.txt

Total script time: 41.98 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED
  different ref/snapshot: 13
  different first/second rendering: 2

Image differences available at: http://54.193.163.58:8877/fa5458fcec2cd73/reftest-analyzer.html#web=eq.log

@timvandermeij timvandermeij merged commit aabd4e5 into mozilla:master Nov 20, 2021
@timvandermeij
Copy link
Contributor

Looks really good; thank you for doing this!

@Snuffleupagus Snuffleupagus deleted the getStats-refactor branch November 20, 2021 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants