Skip to content

Update PDF.js to v2.14.137 #799

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 7, 2022
Merged

Update PDF.js to v2.14.137 #799

merged 2 commits into from
Apr 7, 2022

Conversation

robertknight
Copy link
Member

@robertknight robertknight commented Feb 17, 2022

The first commit in this PR updates the tools/update-pdfjs script to document and standardize how PDF.js updates are committed. The second commit updates PDF.js to the latest release.

In testing this build I found that there have been some whitespace changes in the text layer in some cases (see comments below) and also discrepancies between whitespace that the client extracts from a PDF page by calling PDF.js APIs vs what is in the text layer. The client was already robust to whitespace changes between PDF.js releases when anchoring text quotes, and the most recent updates also make it robust to differences in whitespace between the text we extract via PDF.js APIs vs. the whitespace that PDF.js includes in the text layer. See hypothesis/client#4355.

Once the latest version of PDF.js is rolled out to the browser extension, Via and other apps, we should be able to close hypothesis/product-backlog#1283.

@robertknight
Copy link
Member Author

Investigate "PDF text layer content does not match page text. This will cause anchoring misalignment." warning in console

Comparing the two strings (here split into single-char lines) whose non-equality produced this warning, it looks like there are a few spaces present in the text returned by the PDF.js text extraction APIs which are not present in the textContent of the text layer.

The difference can also be seen by comparing PDF.js's search (which uses the text extraction API) vs the result of copying and pasting text from the document (which uses the textContent of the text layer). Search for the string "lp man" from the first difference below for example, and PDF.js will find a match. If you copy and paste the highlighted text, a space is removed.

--- pdfjs-api-text-lines        2022-02-17 11:59:04.000000000 +0000
+++ pdfjs-layer-text-lines      2022-02-17 11:59:13.000000000 +0000
@@ -808,7 +808,6 @@

 l
 p
-
 m
 a
 n
@@ -1067,7 +1066,6 @@
 e
 l
 l
-
 a
 s

@@ -1146,7 +1144,6 @@

 t
 o
-
 s
 m
 a
@@ -1409,7 +1406,6 @@
 t
 e
 n
-
 t
 h
 e

@robertknight
Copy link
Member Author

robertknight commented Feb 17, 2022

As a general note, to get a list of changes between two updates, you can search for closed PRs since the previous PDF.js update. Changes from v2.11.213 to v2.13.153. Handily they have a label for changes that relate to text selection.

@robertknight
Copy link
Member Author

robertknight commented Feb 18, 2022

Investigate "PDF text layer content does not match page text. This will cause anchoring misalignment." warning in console

Test case for this issue extracted using qpdf: ator-8.pdf (extracted from https://link.springer.com/content/pdf/10.1007%2F978-3-319-73990-8.pdf).

Search for the string "lp man" in this text and it will highlight text that is part of the phrase "help manage". If you now copy and paste the text, the space between "lp" and "man" disappears.

Performing the same test in Safari and Chrome's native PDF viewers, they vary in the spacing that is generated for this pair of words, but the results of searching the document and copy + paste are consistent within the same viewer.

git bisect finds mozilla/pdf.js@6cdae5a as the commit that introduced the issue.

@robertknight
Copy link
Member Author

robertknight commented Mar 18, 2022

Regarding this item:

Investigate "PDF text layer content does not match page text. This will cause anchoring misalignment." warning in console (See #799 (comment) for analysis)

I think it would probably make sense to make the client's anchoring robust to minor (eg. whitespace-only) differences in the text layer in a PDF vs. the extracted content. The reason it currently isn't is because the anchorByPosition function in src/annotator/anchoring/pdf.js assumes that text positions calculated based on text extracted via the PDF.js library will match up with the textContent property of the text layer element. If we could remove that assumption, or weaken it to only require the non-whitespace content to match, this would avoid this problem.

Update: I filed an issue for this - hypothesis/client#4331

@robertknight
Copy link
Member Author

robertknight commented Mar 27, 2022

Looking at one of the mismatches from #799 (comment), around the phrase "can help manage" from ator-8.pdf, the segmentation of the text into text items has changed:

Previous PDF.js version:

PDFjs-old

New PDF.js version:

PDFjs-new

In the new PDF.js release the space after "lp" has been moved into a separate whitespace-only text item. Looking at the DOM, I notice that in each case where a space is missing in the text layer compared to the text items array, the missing space is at a <span class="markedContent"> boundary in the text layer:

Marked content block

Notice there that "lp" is inside a markedContent block and "manage" is outside that block.

Digging around the code I notice that PDF.js passes an includeMarkedContent option when it fetches the text items that it uses to render the text layer, whereas we do not. When this option is specified, output structure changes to match the text layer:

Marked content text items

Note the "Suspect" tag associated with the marked content. The text layer in the PDF is hidden text that has been added by OCR software (despite the source clearly being a digital document 🤦 ), and I guess these tags have been added around some words to indicate possible OCR errors.

@robertknight
Copy link
Member Author

A possible solution to the immediate problem is to ask PDF.js to include marked content items in the text content, so that what we receive matches what PDF.js uses when rendering the text layer:

diff --git a/src/annotator/anchoring/pdf.js b/src/annotator/anchoring/pdf.js
index a6cb1003f..0f2795cad 100644
--- a/src/annotator/anchoring/pdf.js
+++ b/src/annotator/anchoring/pdf.js
@@ -187,6 +187,12 @@ function getPageTextContent(pageIndex) {
   const getPageText = async () => {
     const pageView = await getPageView(pageIndex);
     const textContent = await pageView.pdfPage.getTextContent({
+      // Include marked content marker items in the output. We ignore these,
+      // but set this option because we have seen that it can affect whether
+      // certain whitespace is included in the output, and we want the text
+      // content extracted here to match the text layer as closely as possible.
+      includeMarkedContent: true,
+
       normalizeWhitespace: true,
     });
     let items = textContent.items;
@@ -197,9 +203,18 @@ function getPageTextContent(pageIndex) {
     // introduced the `hasEOL` property to text items, so we use the absence
     // of this property to determine if we need to filter out whitespace-only strings.
     const excludeEmpty = items.length > 0 && !('hasEOL' in items[0]);
-    if (excludeEmpty) {
-      items = items.filter(it => /\S/.test(it.str));
-    }
+
+    items = items.filter(item => {
+      if (item.str === undefined) {
+        // Filter out marked content item.
+        return false;
+      }
+      if (excludeEmpty && !/\S/.test(item.str)) {
+        // Filter out all-whitespace string.
+        return false;
+      }
+      return true;
+    });
 
     return items.map(it => it.str).join('');
   };
diff --git a/src/types/pdfjs.js b/src/types/pdfjs.js
index 4d0a9dfdd..4de7cc13f 100644
--- a/src/types/pdfjs.js
+++ b/src/types/pdfjs.js
@@ -63,6 +63,7 @@
 
 /**
  * @typedef GetTextContentParameters
+ * @prop {boolean} includeMarkedContent
  * @prop {boolean} normalizeWhitespace
  */

I think it would still be worth us looking into making the client robust to differences in spaces between the rendered text layer and what we extract from PDF.js though, as this will make future PDF.js updates easier.

robertknight added a commit to hypothesis/client that referenced this pull request Mar 28, 2022
When anchoring a quote in a PDF, the quote is first searched in text
extracted using PDF.js's `PDFPage.getTextContent` API, and the resulting
positions are used to create a range within the hidden text layer of a
page.

An issue we've seen several times when doing PDF.js upgrades is minor
changes to which spaces are included in the text layer. In the past
we've adapted our text extraction to match the text layer each time.
This slows down the process of upgrading PDF.js and makes maintaining
compatibility with a range of PDF.js releases more difficult. In the
most recent update, an `includeMarkedContent` option was added to the
`getTextContent` API, and the presence of that option could affect
whether certain whitespaces are included in the output nor not [1].

Try to address this issue generally by mapping offsets from the page
text into offsets in the text layer in a way that ignores whitespace
differences.

 - Add `translateOffsets` utility, which translates a (start, end) pair
   of offsets in one string into offsets in another, ignoring certain
   characters in the strings.

 - Use `translateOffsets` utility in PDF anchoring to map quote offsets
   in the page text returned by `PDFPage.getTextContent` into offsets in
   the `textContent` of the text layer element.

[1] hypothesis/browser-extension#799 (comment)
robertknight added a commit to hypothesis/client that referenced this pull request Mar 28, 2022
When anchoring a quote in a PDF, the quote is first searched in text
extracted using PDF.js's `PDFPage.getTextContent` API, and the resulting
positions are used to create a range within the hidden text layer of a
page.

An issue we've seen several times when doing PDF.js upgrades is minor
changes to which spaces are included in the text layer. In the past
we've adapted our text extraction to match the text layer each time.
This slows down the process of upgrading PDF.js and makes maintaining
compatibility with a range of PDF.js releases more difficult. In the
most recent update, an `includeMarkedContent` option was added to the
`getTextContent` API, and the presence of that option could affect
whether certain whitespaces are included in the output nor not [1].

Try to address this issue generally by mapping offsets from the page
text into offsets in the text layer in a way that ignores whitespace
differences.

 - Add `translateOffsets` utility, which translates a (start, end) pair
   of offsets in one string into offsets in another, ignoring certain
   characters in the strings.

 - Use `translateOffsets` utility in PDF anchoring to map quote offsets
   in the page text returned by `PDFPage.getTextContent` into offsets in
   the `textContent` of the text layer element.

[1] hypothesis/browser-extension#799 (comment)
@robertknight
Copy link
Member Author

hypothesis/client#4355 handles changes in spaces in the PDF.js API and text layer in a general way, reducing how often we need to tweak the client's text extraction during PDF.js updates.

robertknight added a commit to hypothesis/client that referenced this pull request Mar 30, 2022
When anchoring a quote in a PDF, the quote is first searched in text
extracted using PDF.js's `PDFPage.getTextContent` API, and the resulting
positions are used to create a range within the hidden text layer of a
page.

An issue we've seen several times when doing PDF.js upgrades is minor
changes to which spaces are included in the text layer. In the past
we've adapted our text extraction to match the text layer each time.
This slows down the process of upgrading PDF.js and makes maintaining
compatibility with a range of PDF.js releases more difficult. In the
most recent update, an `includeMarkedContent` option was added to the
`getTextContent` API, and the presence of that option could affect
whether certain whitespaces are included in the output nor not [1].

Try to address this issue generally by mapping offsets from the page
text into offsets in the text layer in a way that ignores whitespace
differences.

 - Add `translateOffsets` utility, which translates a (start, end) pair
   of offsets in one string into offsets in another, ignoring certain
   characters in the strings.

 - Use `translateOffsets` utility in PDF anchoring to map quote offsets
   in the page text returned by `PDFPage.getTextContent` into offsets in
   the `textContent` of the text layer element.

[1] hypothesis/browser-extension#799 (comment)
robertknight added a commit to hypothesis/client that referenced this pull request Mar 30, 2022
When anchoring a quote in a PDF, the quote is first searched in text
extracted using PDF.js's `PDFPage.getTextContent` API, and the resulting
positions are used to create a range within the hidden text layer of a
page.

An issue we've seen several times when doing PDF.js upgrades is minor
changes to which spaces are included in the text layer. In the past
we've adapted our text extraction to match the text layer each time.
This slows down the process of upgrading PDF.js and makes maintaining
compatibility with a range of PDF.js releases more difficult. In the
most recent update, an `includeMarkedContent` option was added to the
`getTextContent` API, and the presence of that option could affect
whether certain whitespaces are included in the output nor not [1].

Try to address this issue generally by mapping offsets from the page
text into offsets in the text layer in a way that ignores whitespace
differences.

 - Add `translateOffsets` utility, which translates a (start, end) pair
   of offsets in one string into offsets in another, ignoring certain
   characters in the strings.

 - Use `translateOffsets` utility in PDF anchoring to map quote offsets
   in the page text returned by `PDFPage.getTextContent` into offsets in
   the `textContent` of the text layer element.

[1] hypothesis/browser-extension#799 (comment)
robertknight added a commit to hypothesis/client that referenced this pull request Mar 30, 2022
When anchoring a quote in a PDF, the quote is first searched in text
extracted using PDF.js's `PDFPage.getTextContent` API, and the resulting
positions are used to create a range within the hidden text layer of a
page.

An issue we've seen several times when doing PDF.js upgrades is minor
changes to which spaces are included in the text layer. In the past
we've adapted our text extraction to match the text layer each time.
This slows down the process of upgrading PDF.js and makes maintaining
compatibility with a range of PDF.js releases more difficult. In the
most recent update, an `includeMarkedContent` option was added to the
`getTextContent` API, and the presence of that option could affect
whether certain whitespaces are included in the output nor not [1].

Try to address this issue generally by mapping offsets from the page
text into offsets in the text layer in a way that ignores whitespace
differences.

 - Add `translateOffsets` utility, which maps a (start, end) pair
   of offsets in an input string into corresponding offsets in an output
   string, where the output is a version of the input that has been
   "corrupted" by the addition or removal of certain characters (eg.
   whitespace)

 - Use `translateOffsets` utility in PDF anchoring to map quote offsets
   in the page text returned by `PDFPage.getTextContent` into offsets in
   the `textContent` of the text layer element.

[1] hypothesis/browser-extension#799 (comment)
@robertknight
Copy link
Member Author

Investigate why text spacing in the quote has changed compared to the previous PDF.js version when annotating the paragraph beginning "Access management is a very complicated beast" on page 2 of [this PDF](https://link.springer.com/content/pdf/10.1007%2F978-3-319-73990-8.pdf

Text with previous PDF.js build:

"Access management is a very comp licated beast", concluded one of my customers at the end of a lengthy support call. This might indeed renect how many librarians feel these days but it doesn't need to be! After reading this book, you will be able to skillfully navigate the maze of online access management technologies and decide what serves your library's needs best

Text with new build:

"Access management is a v ery comp licated beast",concluded one ofmy customers at the end ofa lengthy supportcall.This might indeed renect how many librariansfeelthese days but itdoesn't need to be! Afterreading thisbook, you willbe able to skillfullynavigate the maze ofonline access management technologies anddecide what serves your library'sneeds best

Some spaces have gone missing.

robertknight added a commit to hypothesis/client that referenced this pull request Mar 30, 2022
When anchoring a quote in a PDF, the quote is first searched in text
extracted using PDF.js's `PDFPage.getTextContent` API, and the resulting
positions are used to create a range within the hidden text layer of a
page.

An issue we've seen several times when doing PDF.js upgrades is minor
changes to which spaces are included in the text layer. In the past
we've adapted our text extraction to match the text layer each time.
This slows down the process of upgrading PDF.js and makes maintaining
compatibility with a range of PDF.js releases more difficult. In the
most recent update, an `includeMarkedContent` option was added to the
`getTextContent` API, and the presence of that option could affect
whether certain whitespaces are included in the output nor not [1].

Try to address this issue generally by mapping offsets from the page
text into offsets in the text layer in a way that ignores whitespace
differences.

 - Add `translateOffsets` utility, which maps a (start, end) pair
   of offsets in an input string into corresponding offsets in an output
   string, where the output is a version of the input that has been
   "corrupted" by the addition or removal of certain characters (eg.
   whitespace)

 - Use `translateOffsets` utility in PDF anchoring to map quote offsets
   in the page text returned by `PDFPage.getTextContent` into offsets in
   the `textContent` of the text layer element.

[1] hypothesis/browser-extension#799 (comment)
@robertknight robertknight force-pushed the update-pdfjs-feb-2022 branch from 9fb7a65 to 44a28c4 Compare March 30, 2022 14:58
@robertknight
Copy link
Member Author

robertknight commented Mar 30, 2022

I have updated again to the latest PDF.js build (2.14.119) and done some testing with recently annotated PDFs in the Public group, using a client build that includes hypothesis/client#4355.

@robertknight robertknight marked this pull request as ready for review March 30, 2022 14:59
robertknight added a commit to hypothesis/client that referenced this pull request Apr 6, 2022
When anchoring a quote in a PDF, the quote is first searched in text
extracted using PDF.js's `PDFPage.getTextContent` API, and the resulting
positions are used to create a range within the hidden text layer of a
page.

An issue we've seen several times when doing PDF.js upgrades is minor
changes to which spaces are included in the text layer. In the past
we've adapted our text extraction to match the text layer each time.
This slows down the process of upgrading PDF.js and makes maintaining
compatibility with a range of PDF.js releases more difficult. In the
most recent update, an `includeMarkedContent` option was added to the
`getTextContent` API, and the presence of that option could affect
whether certain whitespaces are included in the output nor not [1].

Try to address this issue generally by mapping offsets from the page
text into offsets in the text layer in a way that ignores whitespace
differences.

 - Add `translateOffsets` utility, which maps a (start, end) pair
   of offsets in an input string into corresponding offsets in an output
   string, where the output is a version of the input that has been
   "corrupted" by the addition or removal of certain characters (eg.
   whitespace)

 - Use `translateOffsets` utility in PDF anchoring to map quote offsets
   in the page text returned by `PDFPage.getTextContent` into offsets in
   the `textContent` of the text layer element.

[1] hypothesis/browser-extension#799 (comment)
This serves to standardize how PDF.js updates are committed in the
repository.
@robertknight robertknight force-pushed the update-pdfjs-feb-2022 branch from 44a28c4 to 81162e6 Compare April 6, 2022 08:43
@robertknight
Copy link
Member Author

This branch has been updated to include a client build with hypothesis/client#4355, which now means that we can roll out this PDF.js update.

Update PDF.js using ./tools/update-pdfjs.
@robertknight robertknight force-pushed the update-pdfjs-feb-2022 branch from 81162e6 to e084d85 Compare April 7, 2022 07:59
@robertknight robertknight changed the title Update PDF.js to v2.14.119 Update PDF.js to v2.14.137 Apr 7, 2022
@robertknight
Copy link
Member Author

Updated again to v2.14.137 to match the PRs in other repos.

Comment on lines +20 to +27
# Check for uncommitted git changes. See https://stackoverflow.com/a/3879077/434243
git update-index --refresh
git diff-index --quiet HEAD --
if [[ $? -ne 0 ]]; then
echo "Cannot update PDF.js when there are uncommitted changes in working tree."
exit 1
fi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand why this protection was added to this project's script but not the others?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gah. I hadn't actually intended to commit this in this PR, although it would be useful to have. This is an example of why it would be good to avoid duplicating this script across repositories: hypothesis/client#4397.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh. "The safety net is working."

Copy link
Contributor

@lyzadanger lyzadanger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a question about the uncommitted-files check in this project's upgrade-pdfjs script, but otherwise AFAICT this looks good. Was there any testing you wanted me to do? It sounds like this project/branch is where you've done the bulk of the testing already?

@robertknight
Copy link
Member Author

Was there any testing you wanted me to do? It sounds like this project/branch is where you've done the bulk of the testing already?

This project is where I did most of my manual testing. My plan is to roll out the new PDF.js release in the extension first, let it bake for a few days, and then roll it out to Via.

@robertknight robertknight merged commit b729bdc into master Apr 7, 2022
@robertknight robertknight deleted the update-pdfjs-feb-2022 branch April 7, 2022 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Last line of PDF text on a page not selectable in the new PDF.js
2 participants