Update PDF.js to v2.14.137 #799

robertknight · 2022-02-17T10:34:20Z

The first commit in this PR updates the tools/update-pdfjs script to document and standardize how PDF.js updates are committed. The second commit updates PDF.js to the latest release.

In testing this build I found that there have been some whitespace changes in the text layer in some cases (see comments below) and also discrepancies between whitespace that the client extracts from a PDF page by calling PDF.js APIs vs what is in the text layer. The client was already robust to whitespace changes between PDF.js releases when anchoring text quotes, and the most recent updates also make it robust to differences in whitespace between the text we extract via PDF.js APIs vs. the whitespace that PDF.js includes in the text layer. See hypothesis/client#4355.

Once the latest version of PDF.js is rolled out to the browser extension, Via and other apps, we should be able to close hypothesis/product-backlog#1283.

robertknight · 2022-02-17T12:05:29Z

Investigate "PDF text layer content does not match page text. This will cause anchoring misalignment." warning in console

Comparing the two strings (here split into single-char lines) whose non-equality produced this warning, it looks like there are a few spaces present in the text returned by the PDF.js text extraction APIs which are not present in the textContent of the text layer.

The difference can also be seen by comparing PDF.js's search (which uses the text extraction API) vs the result of copying and pasting text from the document (which uses the textContent of the text layer). Search for the string "lp man" from the first difference below for example, and PDF.js will find a match. If you copy and paste the highlighted text, a space is removed.

--- pdfjs-api-text-lines        2022-02-17 11:59:04.000000000 +0000
+++ pdfjs-layer-text-lines      2022-02-17 11:59:13.000000000 +0000
@@ -808,7 +808,6 @@

 l
 p
-
 m
 a
 n
@@ -1067,7 +1066,6 @@
 e
 l
 l
-
 a
 s

@@ -1146,7 +1144,6 @@

 t
 o
-
 s
 m
 a
@@ -1409,7 +1406,6 @@
 t
 e
 n
-
 t
 h
 e

robertknight · 2022-02-17T12:14:20Z

As a general note, to get a list of changes between two updates, you can search for closed PRs since the previous PDF.js update. Changes from v2.11.213 to v2.13.153. Handily they have a label for changes that relate to text selection.

robertknight · 2022-02-18T15:52:28Z

Investigate "PDF text layer content does not match page text. This will cause anchoring misalignment." warning in console

Test case for this issue extracted using qpdf: ator-8.pdf (extracted from https://link.springer.com/content/pdf/10.1007%2F978-3-319-73990-8.pdf).

Search for the string "lp man" in this text and it will highlight text that is part of the phrase "help manage". If you now copy and paste the text, the space between "lp" and "man" disappears.

Performing the same test in Safari and Chrome's native PDF viewers, they vary in the spacing that is generated for this pair of words, but the results of searching the document and copy + paste are consistent within the same viewer.

git bisect finds mozilla/pdf.js@6cdae5a as the commit that introduced the issue.

robertknight · 2022-03-18T14:06:10Z

Regarding this item:

Investigate "PDF text layer content does not match page text. This will cause anchoring misalignment." warning in console (See #799 (comment) for analysis)

I think it would probably make sense to make the client's anchoring robust to minor (eg. whitespace-only) differences in the text layer in a PDF vs. the extracted content. The reason it currently isn't is because the anchorByPosition function in src/annotator/anchoring/pdf.js assumes that text positions calculated based on text extracted via the PDF.js library will match up with the textContent property of the text layer element. If we could remove that assumption, or weaken it to only require the non-whitespace content to match, this would avoid this problem.

Update: I filed an issue for this - hypothesis/client#4331

robertknight · 2022-03-27T07:21:08Z

Looking at one of the mismatches from #799 (comment), around the phrase "can help manage" from ator-8.pdf, the segmentation of the text into text items has changed:

Previous PDF.js version:

New PDF.js version:

In the new PDF.js release the space after "lp" has been moved into a separate whitespace-only text item. Looking at the DOM, I notice that in each case where a space is missing in the text layer compared to the text items array, the missing space is at a <span class="markedContent"> boundary in the text layer:

Notice there that "lp" is inside a markedContent block and "manage" is outside that block.

Digging around the code I notice that PDF.js passes an includeMarkedContent option when it fetches the text items that it uses to render the text layer, whereas we do not. When this option is specified, output structure changes to match the text layer:

Note the "Suspect" tag associated with the marked content. The text layer in the PDF is hidden text that has been added by OCR software (despite the source clearly being a digital document 🤦 ), and I guess these tags have been added around some words to indicate possible OCR errors.

robertknight · 2022-03-27T08:01:05Z

A possible solution to the immediate problem is to ask PDF.js to include marked content items in the text content, so that what we receive matches what PDF.js uses when rendering the text layer:

diff --git a/src/annotator/anchoring/pdf.js b/src/annotator/anchoring/pdf.js
index a6cb1003f..0f2795cad 100644
--- a/src/annotator/anchoring/pdf.js
+++ b/src/annotator/anchoring/pdf.js
@@ -187,6 +187,12 @@ function getPageTextContent(pageIndex) {
   const getPageText = async () => {
     const pageView = await getPageView(pageIndex);
     const textContent = await pageView.pdfPage.getTextContent({
+      // Include marked content marker items in the output. We ignore these,
+      // but set this option because we have seen that it can affect whether
+      // certain whitespace is included in the output, and we want the text
+      // content extracted here to match the text layer as closely as possible.
+      includeMarkedContent: true,
+
       normalizeWhitespace: true,
     });
     let items = textContent.items;
@@ -197,9 +203,18 @@ function getPageTextContent(pageIndex) {
     // introduced the `hasEOL` property to text items, so we use the absence
     // of this property to determine if we need to filter out whitespace-only strings.
     const excludeEmpty = items.length > 0 && !('hasEOL' in items[0]);
-    if (excludeEmpty) {
-      items = items.filter(it => /\S/.test(it.str));
-    }
+
+    items = items.filter(item => {
+      if (item.str === undefined) {
+        // Filter out marked content item.
+        return false;
+      }
+      if (excludeEmpty && !/\S/.test(item.str)) {
+        // Filter out all-whitespace string.
+        return false;
+      }
+      return true;
+    });
 
     return items.map(it => it.str).join('');
   };
diff --git a/src/types/pdfjs.js b/src/types/pdfjs.js
index 4d0a9dfdd..4de7cc13f 100644
--- a/src/types/pdfjs.js
+++ b/src/types/pdfjs.js
@@ -63,6 +63,7 @@
 
 /**
  * @typedef GetTextContentParameters
+ * @prop {boolean} includeMarkedContent
  * @prop {boolean} normalizeWhitespace
  */

I think it would still be worth us looking into making the client robust to differences in spaces between the rendered text layer and what we extract from PDF.js though, as this will make future PDF.js updates easier.

When anchoring a quote in a PDF, the quote is first searched in text extracted using PDF.js's `PDFPage.getTextContent` API, and the resulting positions are used to create a range within the hidden text layer of a page. An issue we've seen several times when doing PDF.js upgrades is minor changes to which spaces are included in the text layer. In the past we've adapted our text extraction to match the text layer each time. This slows down the process of upgrading PDF.js and makes maintaining compatibility with a range of PDF.js releases more difficult. In the most recent update, an `includeMarkedContent` option was added to the `getTextContent` API, and the presence of that option could affect whether certain whitespaces are included in the output nor not [1]. Try to address this issue generally by mapping offsets from the page text into offsets in the text layer in a way that ignores whitespace differences. - Add `translateOffsets` utility, which translates a (start, end) pair of offsets in one string into offsets in another, ignoring certain characters in the strings. - Use `translateOffsets` utility in PDF anchoring to map quote offsets in the page text returned by `PDFPage.getTextContent` into offsets in the `textContent` of the text layer element. [1] hypothesis/browser-extension#799 (comment)

robertknight · 2022-03-28T09:07:52Z

hypothesis/client#4355 handles changes in spaces in the PDF.js API and text layer in a general way, reducing how often we need to tweak the client's text extraction during PDF.js updates.

When anchoring a quote in a PDF, the quote is first searched in text extracted using PDF.js's `PDFPage.getTextContent` API, and the resulting positions are used to create a range within the hidden text layer of a page. An issue we've seen several times when doing PDF.js upgrades is minor changes to which spaces are included in the text layer. In the past we've adapted our text extraction to match the text layer each time. This slows down the process of upgrading PDF.js and makes maintaining compatibility with a range of PDF.js releases more difficult. In the most recent update, an `includeMarkedContent` option was added to the `getTextContent` API, and the presence of that option could affect whether certain whitespaces are included in the output nor not [1]. Try to address this issue generally by mapping offsets from the page text into offsets in the text layer in a way that ignores whitespace differences. - Add `translateOffsets` utility, which translates a (start, end) pair of offsets in one string into offsets in another, ignoring certain characters in the strings. - Use `translateOffsets` utility in PDF anchoring to map quote offsets in the page text returned by `PDFPage.getTextContent` into offsets in the `textContent` of the text layer element. [1] hypothesis/browser-extension#799 (comment)

When anchoring a quote in a PDF, the quote is first searched in text extracted using PDF.js's `PDFPage.getTextContent` API, and the resulting positions are used to create a range within the hidden text layer of a page. An issue we've seen several times when doing PDF.js upgrades is minor changes to which spaces are included in the text layer. In the past we've adapted our text extraction to match the text layer each time. This slows down the process of upgrading PDF.js and makes maintaining compatibility with a range of PDF.js releases more difficult. In the most recent update, an `includeMarkedContent` option was added to the `getTextContent` API, and the presence of that option could affect whether certain whitespaces are included in the output nor not [1]. Try to address this issue generally by mapping offsets from the page text into offsets in the text layer in a way that ignores whitespace differences. - Add `translateOffsets` utility, which maps a (start, end) pair of offsets in an input string into corresponding offsets in an output string, where the output is a version of the input that has been "corrupted" by the addition or removal of certain characters (eg. whitespace) - Use `translateOffsets` utility in PDF anchoring to map quote offsets in the page text returned by `PDFPage.getTextContent` into offsets in the `textContent` of the text layer element. [1] hypothesis/browser-extension#799 (comment)

robertknight · 2022-03-30T14:37:56Z

Investigate why text spacing in the quote has changed compared to the previous PDF.js version when annotating the paragraph beginning "Access management is a very complicated beast" on page 2 of [this PDF](https://link.springer.com/content/pdf/10.1007%2F978-3-319-73990-8.pdf

Text with previous PDF.js build:

"Access management is a very comp licated beast", concluded one of my customers at the end of a lengthy support call. This might indeed renect how many librarians feel these days but it doesn't need to be! After reading this book, you will be able to skillfully navigate the maze of online access management technologies and decide what serves your library's needs best

Text with new build:

"Access management is a v ery comp licated beast",concluded one ofmy customers at the end ofa lengthy supportcall.This might indeed renect how many librariansfeelthese days but itdoesn't need to be! Afterreading thisbook, you willbe able to skillfullynavigate the maze ofonline access management technologies anddecide what serves your library'sneeds best

Some spaces have gone missing.

When anchoring a quote in a PDF, the quote is first searched in text extracted using PDF.js's `PDFPage.getTextContent` API, and the resulting positions are used to create a range within the hidden text layer of a page. An issue we've seen several times when doing PDF.js upgrades is minor changes to which spaces are included in the text layer. In the past we've adapted our text extraction to match the text layer each time. This slows down the process of upgrading PDF.js and makes maintaining compatibility with a range of PDF.js releases more difficult. In the most recent update, an `includeMarkedContent` option was added to the `getTextContent` API, and the presence of that option could affect whether certain whitespaces are included in the output nor not [1]. Try to address this issue generally by mapping offsets from the page text into offsets in the text layer in a way that ignores whitespace differences. - Add `translateOffsets` utility, which maps a (start, end) pair of offsets in an input string into corresponding offsets in an output string, where the output is a version of the input that has been "corrupted" by the addition or removal of certain characters (eg. whitespace) - Use `translateOffsets` utility in PDF anchoring to map quote offsets in the page text returned by `PDFPage.getTextContent` into offsets in the `textContent` of the text layer element. [1] hypothesis/browser-extension#799 (comment)

robertknight · 2022-03-30T14:58:48Z

I have updated again to the latest PDF.js build (2.14.119) and done some testing with recently annotated PDFs in the Public group, using a client build that includes hypothesis/client#4355.

When anchoring a quote in a PDF, the quote is first searched in text extracted using PDF.js's `PDFPage.getTextContent` API, and the resulting positions are used to create a range within the hidden text layer of a page. An issue we've seen several times when doing PDF.js upgrades is minor changes to which spaces are included in the text layer. In the past we've adapted our text extraction to match the text layer each time. This slows down the process of upgrading PDF.js and makes maintaining compatibility with a range of PDF.js releases more difficult. In the most recent update, an `includeMarkedContent` option was added to the `getTextContent` API, and the presence of that option could affect whether certain whitespaces are included in the output nor not [1]. Try to address this issue generally by mapping offsets from the page text into offsets in the text layer in a way that ignores whitespace differences. - Add `translateOffsets` utility, which maps a (start, end) pair of offsets in an input string into corresponding offsets in an output string, where the output is a version of the input that has been "corrupted" by the addition or removal of certain characters (eg. whitespace) - Use `translateOffsets` utility in PDF anchoring to map quote offsets in the page text returned by `PDFPage.getTextContent` into offsets in the `textContent` of the text layer element. [1] hypothesis/browser-extension#799 (comment)

This serves to standardize how PDF.js updates are committed in the repository.

robertknight · 2022-04-06T08:59:06Z

This branch has been updated to include a client build with hypothesis/client#4355, which now means that we can roll out this PDF.js update.

Update PDF.js using ./tools/update-pdfjs.

robertknight · 2022-04-07T08:01:51Z

Updated again to v2.14.137 to match the PRs in other repos.

lyzadanger · 2022-04-07T13:01:14Z

tools/update-pdfjs

+# Check for uncommitted git changes. See https://stackoverflow.com/a/3879077/434243
+git update-index --refresh
+git diff-index --quiet HEAD --
+if [[ $? -ne 0 ]]; then
+  echo "Cannot update PDF.js when there are uncommitted changes in working tree."
+  exit 1
+fi
+


Can you help me understand why this protection was added to this project's script but not the others?

Gah. I hadn't actually intended to commit this in this PR, although it would be useful to have. This is an example of why it would be good to avoid duplicating this script across repositories: hypothesis/client#4397.

Heh. "The safety net is working."

lyzadanger

I had a question about the uncommitted-files check in this project's upgrade-pdfjs script, but otherwise AFAICT this looks good. Was there any testing you wanted me to do? It sounds like this project/branch is where you've done the bulk of the testing already?

robertknight · 2022-04-07T13:18:28Z

Was there any testing you wanted me to do? It sounds like this project/branch is where you've done the bulk of the testing already?

This project is where I did most of my manual testing. My plan is to roll out the new PDF.js release in the extension first, let it bake for a few days, and then roll it out to Via.

This was referenced Mar 18, 2022

Make PDF anchoring robust to minor differences between extracted text and text layer hypothesis/client#4331

Closed

Last line of PDF text on a page not selectable in the new PDF.js hypothesis/product-backlog#1283

Closed

robertknight mentioned this pull request Mar 28, 2022

Allow for whitespace differences between page text and text layer hypothesis/client#4355

Merged

robertknight force-pushed the update-pdfjs-feb-2022 branch from 9fb7a65 to 44a28c4 Compare March 30, 2022 14:58

robertknight marked this pull request as ready for review March 30, 2022 14:59

Commit updated PDF.js build as part of tools/update-pdfjs

a63086b

This serves to standardize how PDF.js updates are committed in the repository.

robertknight force-pushed the update-pdfjs-feb-2022 branch from 44a28c4 to 81162e6 Compare April 6, 2022 08:43

robertknight changed the title ~~Update PDF.js to v2.13.153~~ Update PDF.js to v2.14.119 Apr 6, 2022

This was referenced Apr 6, 2022

Update PDF.js to v2.14.137 and improve automation/documentation of process hypothesis/via#720

Merged

Update PDF.js to v2.14.137 hypothesis/pdf.js-hypothes.is#26

Merged

Update PDF.js to v2.14.137

e084d85

Update PDF.js using ./tools/update-pdfjs.

robertknight force-pushed the update-pdfjs-feb-2022 branch from 81162e6 to e084d85 Compare April 7, 2022 07:59

robertknight changed the title ~~Update PDF.js to v2.14.119~~ Update PDF.js to v2.14.137 Apr 7, 2022

robertknight mentioned this pull request Apr 7, 2022

Update PDF.js to v2.14.137 hypothesis/client#4396

Merged

lyzadanger reviewed Apr 7, 2022

View reviewed changes

lyzadanger approved these changes Apr 7, 2022

View reviewed changes

robertknight merged commit b729bdc into master Apr 7, 2022

robertknight deleted the update-pdfjs-feb-2022 branch April 7, 2022 13:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update PDF.js to v2.14.137 #799

Update PDF.js to v2.14.137 #799

robertknight commented Feb 17, 2022 •

edited

Loading

robertknight commented Feb 17, 2022

robertknight commented Feb 17, 2022 •

edited

Loading

robertknight commented Feb 18, 2022 •

edited

Loading

robertknight commented Mar 18, 2022 •

edited

Loading

robertknight commented Mar 27, 2022 •

edited

Loading

robertknight commented Mar 27, 2022

robertknight commented Mar 28, 2022

robertknight commented Mar 30, 2022

robertknight commented Mar 30, 2022 •

edited

Loading

robertknight commented Apr 6, 2022

robertknight commented Apr 7, 2022

lyzadanger Apr 7, 2022

robertknight Apr 7, 2022

lyzadanger Apr 7, 2022

lyzadanger left a comment

robertknight commented Apr 7, 2022

Update PDF.js to v2.14.137 #799

Update PDF.js to v2.14.137 #799

Conversation

robertknight commented Feb 17, 2022 • edited Loading

robertknight commented Feb 17, 2022

robertknight commented Feb 17, 2022 • edited Loading

robertknight commented Feb 18, 2022 • edited Loading

robertknight commented Mar 18, 2022 • edited Loading

robertknight commented Mar 27, 2022 • edited Loading

robertknight commented Mar 27, 2022

robertknight commented Mar 28, 2022

robertknight commented Mar 30, 2022

robertknight commented Mar 30, 2022 • edited Loading

robertknight commented Apr 6, 2022

robertknight commented Apr 7, 2022

lyzadanger Apr 7, 2022

Choose a reason for hiding this comment

robertknight Apr 7, 2022

Choose a reason for hiding this comment

lyzadanger Apr 7, 2022

Choose a reason for hiding this comment

lyzadanger left a comment

Choose a reason for hiding this comment

robertknight commented Apr 7, 2022

robertknight commented Feb 17, 2022 •

edited

Loading

robertknight commented Feb 17, 2022 •

edited

Loading

robertknight commented Feb 18, 2022 •

edited

Loading

robertknight commented Mar 18, 2022 •

edited

Loading

robertknight commented Mar 27, 2022 •

edited

Loading

robertknight commented Mar 30, 2022 •

edited

Loading