Convert `Catalog.getAllPageDicts` to an `async` method #14411

Snuffleupagus · 2021-12-31T15:06:27Z

The patch in PR #14335 essentially re-introduced the old code from before PR #3848, however looking at this code a bit closer it should be possible to simplify it by making the method asynchronous.

While this method is currently only used as a fallback in corrupt documents, the way that MissingDataExceptions are handled is less than ideal. Note that if a MissingDataException is thrown, we're forced to re-parse the entire /Pages tree[1].
With this method now being asynchronous, we're able to handle fetching of References in a much easier/nicer way than before without having to throw MissingDataExceptions and re-parse anything.
These changes also let us simplify the call-site slightly, by calling the method directly instead of using the PDFManager-instance (since again it will no longer throw MissingDataExceptions).

Furthermore, this patch contains the following other changes:

Reduce unnecessary duplication in the various catch handlers throughout the method, by simply moving the XRefEntryException handling into the addPageError helper function instead.
Move the "circular references"-check to occur slightly earlier, since there's obviously no point in asynchronously fetching data just to then throw an Error immediately afterwards.

[1] Imagine e.g. a thousand page document, where there's a MissingDataException thrown when fetching/parsing page 900.

The patch in PR 14335 *essentially* re-introduced the old code from before PR 3848, however looking at this code a bit closer it should be possible to simplify it by making the method asynchronous. While this method is currently only used as a *fallback* in corrupt documents, the way that `MissingDataException`s are handled is less than ideal. Note that if a `MissingDataException` is thrown, we're forced to re-parse the *entire* /Pages tree[1]. With this method now being asynchronous, we're able to handle fetching of References in a *much* easier/nicer way than before without having to throw `MissingDataException`s and re-parse anything. These changes also let us simplify the call-site slightly, by calling the method *directly* instead of using the `PDFManager`-instance (since again it will no longer throw `MissingDataException`s). Furthermore, this patch contains the following other changes: - Reduce unnecessary duplication in the various `catch` handlers throughout the method, by simply moving the `XRefEntryException` handling into the `addPageError` helper function instead. - Move the "circular references"-check to occur slightly earlier, since there's obviously no point in asynchronously fetching data just to then throw an Error *immediately* afterwards. --- [1] Imagine e.g. a thousand page document, where there's a `MissingDataException` thrown when fetching/parsing page 900.

pdfjsbot · 2021-12-31T21:13:28Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/7119004af3fdacc/output.txt

pdfjsbot · 2021-12-31T21:13:28Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/03ce5319099314d/output.txt

pdfjsbot · 2021-12-31T21:35:50Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.241.84.105:8877/7119004af3fdacc/output.txt

Total script time: 22.35 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: Passed
Regression tests: FAILED

  different ref/snapshot: 12
  different first/second rendering: 1

Image differences available at: http://54.241.84.105:8877/7119004af3fdacc/reftest-analyzer.html#web=eq.log

pdfjsbot · 2021-12-31T21:43:38Z

From: Bot.io (Windows)

Failed

Full output at http://54.193.163.58:8877/03ce5319099314d/output.txt

Total script time: 30.14 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: FAILED
Regression tests: FAILED

  errors: 649
  different ref/snapshot: 8
  different first/second rendering: 1

Image differences available at: http://54.193.163.58:8877/03ce5319099314d/reftest-analyzer.html#web=eq.log

Snuffleupagus · 2021-12-31T21:48:58Z

/botio-windows test

pdfjsbot · 2021-12-31T21:48:59Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/11532ae0b0fcb47/output.txt

pdfjsbot · 2021-12-31T22:31:28Z

From: Bot.io (Windows)

Failed

Full output at http://54.193.163.58:8877/11532ae0b0fcb47/output.txt

Total script time: 42.47 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: Passed
Regression tests: FAILED

  different ref/snapshot: 7
  different first/second rendering: 2

Image differences available at: http://54.193.163.58:8877/11532ae0b0fcb47/reftest-analyzer.html#web=eq.log

timvandermeij · 2022-01-01T13:43:25Z

Looks good!

Snuffleupagus added the core label Dec 31, 2021

Snuffleupagus force-pushed the getAllPageDicts-async branch from 256207e to b0e774d Compare December 31, 2021 21:09

mozilla deleted a comment from pdfjsbot Dec 31, 2021

timvandermeij approved these changes Jan 1, 2022

View reviewed changes

timvandermeij merged commit f287c5f into mozilla:master Jan 1, 2022

Snuffleupagus deleted the getAllPageDicts-async branch January 1, 2022 14:36

Snuffleupagus mentioned this pull request Feb 16, 2022

Get "unnecessary" range on first page #14570

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert `Catalog.getAllPageDicts` to an `async` method #14411

Convert `Catalog.getAllPageDicts` to an `async` method #14411

Snuffleupagus commented Dec 31, 2021

pdfjsbot commented Dec 31, 2021

pdfjsbot commented Dec 31, 2021

pdfjsbot commented Dec 31, 2021

pdfjsbot commented Dec 31, 2021

Snuffleupagus commented Dec 31, 2021

pdfjsbot commented Dec 31, 2021

pdfjsbot commented Dec 31, 2021

timvandermeij commented Jan 1, 2022

Convert Catalog.getAllPageDicts to an async method #14411

Convert Catalog.getAllPageDicts to an async method #14411

Conversation

Snuffleupagus commented Dec 31, 2021

pdfjsbot commented Dec 31, 2021

From: Bot.io (Linux m4)

Received

pdfjsbot commented Dec 31, 2021

From: Bot.io (Windows)

Received

pdfjsbot commented Dec 31, 2021

From: Bot.io (Linux m4)

Failed

pdfjsbot commented Dec 31, 2021

From: Bot.io (Windows)

Failed

Snuffleupagus commented Dec 31, 2021

pdfjsbot commented Dec 31, 2021

From: Bot.io (Windows)

Received

pdfjsbot commented Dec 31, 2021

From: Bot.io (Windows)

Failed

timvandermeij commented Jan 1, 2022

Convert `Catalog.getAllPageDicts` to an `async` method #14411

Convert `Catalog.getAllPageDicts` to an `async` method #14411