[api-minor] Replace `PDFDocumentProxy.getStats` with a synchronous `PDFDocumentProxy.stats` getter #14294

Snuffleupagus · 2021-11-20T11:24:48Z

Please note: These changes will primarily benefit longer documents, somewhat at the expense of e.g. one-page documents.

The existing PDFDocumentProxy.getStats function, which in the default viewer is called for each rendered page, requires a round-trip to the worker-thread in order to obtain the current document stats. In the default viewer, we currently make one such API-call for every rendered page.
This patch proposes replacing that method with a synchronous PDFDocumentProxy.stats getter instead, combined with re-factoring the worker-thread code by adding a DocStats-class to track Stream/Font-types and only send them to the main-thread the first time that a type is encountered.

Note that in practice most PDF documents only use a fairly limited number of Stream/Font-types, which means that in longer documents most of the PDFDocumentProxy.getStats-calls will return the same data.[1]
This re-factoring will obviously benefit longer document the most[2], and could actually be seen as a regression for one-page documents, since in practice there'll usually be a couple of "DocStats" messages sent during the parsing of the first page. However, if the user zooms/rotates the document (which causes re-rendering), note that even a one-page document would start to benefit from these changes.

Another benefit of having the data available/cached in the API is that unless the document stats change during parsing, repeated PDFDocumentProxy.stats-calls will return the same identical object.
This is something that we can easily take advantage of in the default viewer, by now only reporting "documentStats" telemetry[3] when the data actually have changed rather than once per rendered page (again beneficial in longer documents).

[1] Furthermore, the maximium number of StreamType/FontType are 10 respectively 12, which means that regardless of the complexity and page count in a PDF document there'll never be more than twenty-two "DocStats" messages sent; see

pdf.js/src/shared/util.js

Lines 206 to 232 in 41ac3f0

    
           const StreamType = { 
        
             UNKNOWN: "UNKNOWN", 
        
             FLATE: "FLATE", 
        
             LZW: "LZW", 
        
             DCT: "DCT", 
        
             JPX: "JPX", 
        
             JBIG: "JBIG", 
        
             A85: "A85", 
        
             AHX: "AHX", 
        
             CCF: "CCF", 
        
             RLX: "RLX", // PDF short name is 'RL', but telemetry requires three chars. 
        
           }; 
        
           const FontType = { 
        
             UNKNOWN: "UNKNOWN", 
        
             TYPE1: "TYPE1", 
        
             TYPE1STANDARD: "TYPE1STANDARD", 
        
             TYPE1C: "TYPE1C", 
        
             CIDFONTTYPE0: "CIDFONTTYPE0", 
        
             CIDFONTTYPE0C: "CIDFONTTYPE0C", 
        
             TRUETYPE: "TRUETYPE", 
        
             CIDFONTTYPE2: "CIDFONTTYPE2", 
        
             TYPE3: "TYPE3", 
        
             OPENTYPE: "OPENTYPE", 
        
             TYPE0: "TYPE0", 
        
             MMTYPE1: "MMTYPE1", 
        
           };

[2] One example is the pdf.pdf document in the test-suite, where rendering all of its 1310 pages only result in a total of seven "DocStats" messages being sent from the worker-thread.

[3] Reporting telemetry, in Firefox, includes using JSON.stringify on the data and then sending an event to the PdfStreamConverter.jsm-code.
In that code the event is handled and JSON.parse is used to retrieve the data, and in the "documentStats"-case we'll then iterate through the data to avoid double-reporting telemetry; see https://searchfox.org/mozilla-central/rev/8f4c180b87e52f3345ef8a3432d6e54bd1eb18dc/toolkit/components/pdfjs/content/PdfStreamConverter.jsm#515-549

…DFDocumentProxy.stats` getter *Please note:* These changes will primarily benefit longer documents, somewhat at the expense of e.g. one-page documents. The existing `PDFDocumentProxy.getStats` function, which in the default viewer is called for each rendered page, requires a round-trip to the worker-thread in order to obtain the current document stats. In the default viewer, we currently make one such API-call for *every rendered* page. This patch proposes replacing that method with a *synchronous* `PDFDocumentProxy.stats` getter instead, combined with re-factoring the worker-thread code by adding a `DocStats`-class to track Stream/Font-types and *only send* them to the main-thread *the first time* that a type is encountered. Note that in practice most PDF documents only use a fairly limited number of Stream/Font-types, which means that in longer documents most of the `PDFDocumentProxy.getStats`-calls will return the same data.[1] This re-factoring will obviously benefit longer document the most[2], and could actually be seen as a regression for one-page documents, since in practice there'll usually be a couple of "DocStats" messages sent during the parsing of the first page. However, if the user zooms/rotates the document (which causes re-rendering), note that even a one-page document would start to benefit from these changes. Another benefit of having the data available/cached in the API is that unless the document stats change during parsing, repeated `PDFDocumentProxy.stats`-calls will return *the same identical* object. This is something that we can easily take advantage of in the default viewer, by now *only* reporting "documentStats" telemetry[3] when the data actually have changed rather than once per rendered page (again beneficial in longer documents). --- [1] Furthermore, the maximium number of `StreamType`/`FontType` are `10` respectively `12`, which means that regardless of the complexity and page count in a PDF document there'll never be more than twenty-two "DocStats" messages sent; see https://github.com/mozilla/pdf.js/blob/41ac3f0c07128bf34baccdcc067a108c712fd6ef/src/shared/util.js#L206-L232 [2] One example is the `pdf.pdf` document in the test-suite, where rendering all of its 1310 pages only result in a total of seven "DocStats" messages being sent from the worker-thread. [3] Reporting telemetry, in Firefox, includes using `JSON.stringify` on the data and then sending an event to the `PdfStreamConverter.jsm`-code. In that code the event is handled and `JSON.parse` is used to retrieve the data, and in the "documentStats"-case we'll then iterate through the data to avoid double-reporting telemetry; see https://searchfox.org/mozilla-central/rev/8f4c180b87e52f3345ef8a3432d6e54bd1eb18dc/toolkit/components/pdfjs/content/PdfStreamConverter.jsm#515-549

Snuffleupagus · 2021-11-20T11:29:48Z

/botio test

pdfjsbot · 2021-11-20T11:29:49Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/58c84ed279390fa/output.txt

pdfjsbot · 2021-11-20T11:29:49Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/fa5458fcec2cd73/output.txt

pdfjsbot · 2021-11-20T11:51:39Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.241.84.105:8877/58c84ed279390fa/output.txt

Total script time: 21.82 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: FAILED
Regression tests: FAILED

  different ref/snapshot: 9
  different first/second rendering: 1

Image differences available at: http://54.241.84.105:8877/58c84ed279390fa/reftest-analyzer.html#web=eq.log

pdfjsbot · 2021-11-20T12:11:49Z

From: Bot.io (Windows)

Failed

Full output at http://54.193.163.58:8877/fa5458fcec2cd73/output.txt

Total script time: 41.98 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: Passed
Regression tests: FAILED

  different ref/snapshot: 13
  different first/second rendering: 2

Image differences available at: http://54.193.163.58:8877/fa5458fcec2cd73/reftest-analyzer.html#web=eq.log

timvandermeij · 2021-11-20T14:42:56Z

Looks really good; thank you for doing this!

Snuffleupagus added core viewer labels Nov 20, 2021

timvandermeij approved these changes Nov 20, 2021

View reviewed changes

timvandermeij merged commit aabd4e5 into mozilla:master Nov 20, 2021

Snuffleupagus deleted the getStats-refactor branch November 20, 2021 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[api-minor] Replace `PDFDocumentProxy.getStats` with a synchronous `PDFDocumentProxy.stats` getter #14294

[api-minor] Replace `PDFDocumentProxy.getStats` with a synchronous `PDFDocumentProxy.stats` getter #14294

Snuffleupagus commented Nov 20, 2021

Snuffleupagus commented Nov 20, 2021

pdfjsbot commented Nov 20, 2021

pdfjsbot commented Nov 20, 2021

pdfjsbot commented Nov 20, 2021

pdfjsbot commented Nov 20, 2021

timvandermeij commented Nov 20, 2021

	const StreamType = {
	UNKNOWN: "UNKNOWN",
	FLATE: "FLATE",
	LZW: "LZW",
	DCT: "DCT",
	JPX: "JPX",
	JBIG: "JBIG",
	A85: "A85",
	AHX: "AHX",
	CCF: "CCF",
	RLX: "RLX", // PDF short name is 'RL', but telemetry requires three chars.
	};

	const FontType = {
	UNKNOWN: "UNKNOWN",
	TYPE1: "TYPE1",
	TYPE1STANDARD: "TYPE1STANDARD",
	TYPE1C: "TYPE1C",
	CIDFONTTYPE0: "CIDFONTTYPE0",
	CIDFONTTYPE0C: "CIDFONTTYPE0C",
	TRUETYPE: "TRUETYPE",
	CIDFONTTYPE2: "CIDFONTTYPE2",
	TYPE3: "TYPE3",
	OPENTYPE: "OPENTYPE",
	TYPE0: "TYPE0",
	MMTYPE1: "MMTYPE1",
	};

[api-minor] Replace PDFDocumentProxy.getStats with a synchronous PDFDocumentProxy.stats getter #14294

[api-minor] Replace PDFDocumentProxy.getStats with a synchronous PDFDocumentProxy.stats getter #14294

Conversation

Snuffleupagus commented Nov 20, 2021

Snuffleupagus commented Nov 20, 2021

pdfjsbot commented Nov 20, 2021

From: Bot.io (Linux m4)

Received

pdfjsbot commented Nov 20, 2021

From: Bot.io (Windows)

Received

pdfjsbot commented Nov 20, 2021

From: Bot.io (Linux m4)

Failed

pdfjsbot commented Nov 20, 2021

From: Bot.io (Windows)

Failed

timvandermeij commented Nov 20, 2021

[api-minor] Replace `PDFDocumentProxy.getStats` with a synchronous `PDFDocumentProxy.stats` getter #14294

[api-minor] Replace `PDFDocumentProxy.getStats` with a synchronous `PDFDocumentProxy.stats` getter #14294