Skip to content

Convert Catalog.getAllPageDicts to an async method #14411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 1, 2022

Conversation

Snuffleupagus
Copy link
Collaborator

The patch in PR #14335 essentially re-introduced the old code from before PR #3848, however looking at this code a bit closer it should be possible to simplify it by making the method asynchronous.

While this method is currently only used as a fallback in corrupt documents, the way that MissingDataExceptions are handled is less than ideal. Note that if a MissingDataException is thrown, we're forced to re-parse the entire /Pages tree[1].
With this method now being asynchronous, we're able to handle fetching of References in a much easier/nicer way than before without having to throw MissingDataExceptions and re-parse anything.
These changes also let us simplify the call-site slightly, by calling the method directly instead of using the PDFManager-instance (since again it will no longer throw MissingDataExceptions).

Furthermore, this patch contains the following other changes:

  • Reduce unnecessary duplication in the various catch handlers throughout the method, by simply moving the XRefEntryException handling into the addPageError helper function instead.
  • Move the "circular references"-check to occur slightly earlier, since there's obviously no point in asynchronously fetching data just to then throw an Error immediately afterwards.

[1] Imagine e.g. a thousand page document, where there's a MissingDataException thrown when fetching/parsing page 900.

The patch in PR 14335 *essentially* re-introduced the old code from before PR 3848, however looking at this code a bit closer it should be possible to simplify it by making the method asynchronous.

While this method is currently only used as a *fallback* in corrupt documents, the way that `MissingDataException`s are handled is less than ideal. Note that if a `MissingDataException` is thrown, we're forced to re-parse the *entire* /Pages tree[1].
With this method now being asynchronous, we're able to handle fetching of References in a *much* easier/nicer way than before without having to throw `MissingDataException`s and re-parse anything.
These changes also let us simplify the call-site slightly, by calling the method *directly* instead of using the `PDFManager`-instance (since again it will no longer throw `MissingDataException`s).

Furthermore, this patch contains the following other changes:
 - Reduce unnecessary duplication in the various `catch` handlers throughout the method, by simply moving the `XRefEntryException` handling into the `addPageError` helper function instead.
 - Move the "circular references"-check to occur slightly earlier, since there's obviously no point in asynchronously fetching data just to then throw an Error *immediately* afterwards.

---
[1] Imagine e.g. a thousand page document, where there's a `MissingDataException` thrown when fetching/parsing page 900.
@mozilla mozilla deleted a comment from pdfjsbot Dec 31, 2021
@mozilla mozilla deleted a comment from pdfjsbot Dec 31, 2021
@mozilla mozilla deleted a comment from pdfjsbot Dec 31, 2021
@mozilla mozilla deleted a comment from pdfjsbot Dec 31, 2021
@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/7119004af3fdacc/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/03ce5319099314d/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.241.84.105:8877/7119004af3fdacc/output.txt

Total script time: 22.35 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED
  different ref/snapshot: 12
  different first/second rendering: 1

Image differences available at: http://54.241.84.105:8877/7119004af3fdacc/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/03ce5319099314d/output.txt

Total script time: 30.14 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: FAILED
  • Regression tests: FAILED
  errors: 649
  different ref/snapshot: 8
  different first/second rendering: 1

Image differences available at: http://54.193.163.58:8877/03ce5319099314d/reftest-analyzer.html#web=eq.log

@Snuffleupagus
Copy link
Collaborator Author

/botio-windows test

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/11532ae0b0fcb47/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/11532ae0b0fcb47/output.txt

Total script time: 42.47 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED
  different ref/snapshot: 7
  different first/second rendering: 2

Image differences available at: http://54.193.163.58:8877/11532ae0b0fcb47/reftest-analyzer.html#web=eq.log

@timvandermeij timvandermeij merged commit f287c5f into mozilla:master Jan 1, 2022
@timvandermeij
Copy link
Contributor

Looks good!

@Snuffleupagus Snuffleupagus deleted the getAllPageDicts-async branch January 1, 2022 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants