-
Notifications
You must be signed in to change notification settings - Fork 134
Update PDF.js to v2.14.137 #799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Comparing the two strings (here split into single-char lines) whose non-equality produced this warning, it looks like there are a few spaces present in the text returned by the PDF.js text extraction APIs which are not present in the The difference can also be seen by comparing PDF.js's search (which uses the text extraction API) vs the result of copying and pasting text from the document (which uses the
|
As a general note, to get a list of changes between two updates, you can search for closed PRs since the previous PDF.js update. Changes from v2.11.213 to v2.13.153. Handily they have a label for changes that relate to text selection. |
Test case for this issue extracted using qpdf: ator-8.pdf (extracted from https://link.springer.com/content/pdf/10.1007%2F978-3-319-73990-8.pdf). Search for the string "lp man" in this text and it will highlight text that is part of the phrase "help manage". If you now copy and paste the text, the space between "lp" and "man" disappears. Performing the same test in Safari and Chrome's native PDF viewers, they vary in the spacing that is generated for this pair of words, but the results of searching the document and copy + paste are consistent within the same viewer.
|
Regarding this item:
I think it would probably make sense to make the client's anchoring robust to minor (eg. whitespace-only) differences in the text layer in a PDF vs. the extracted content. The reason it currently isn't is because the Update: I filed an issue for this - hypothesis/client#4331 |
Looking at one of the mismatches from #799 (comment), around the phrase "can help manage" from ator-8.pdf, the segmentation of the text into text items has changed: Previous PDF.js version: New PDF.js version: In the new PDF.js release the space after "lp" has been moved into a separate whitespace-only text item. Looking at the DOM, I notice that in each case where a space is missing in the text layer compared to the text items array, the missing space is at a Notice there that "lp" is inside a Digging around the code I notice that PDF.js passes an Note the "Suspect" tag associated with the marked content. The text layer in the PDF is hidden text that has been added by OCR software (despite the source clearly being a digital document 🤦 ), and I guess these tags have been added around some words to indicate possible OCR errors. |
A possible solution to the immediate problem is to ask PDF.js to include marked content items in the text content, so that what we receive matches what PDF.js uses when rendering the text layer: diff --git a/src/annotator/anchoring/pdf.js b/src/annotator/anchoring/pdf.js
index a6cb1003f..0f2795cad 100644
--- a/src/annotator/anchoring/pdf.js
+++ b/src/annotator/anchoring/pdf.js
@@ -187,6 +187,12 @@ function getPageTextContent(pageIndex) {
const getPageText = async () => {
const pageView = await getPageView(pageIndex);
const textContent = await pageView.pdfPage.getTextContent({
+ // Include marked content marker items in the output. We ignore these,
+ // but set this option because we have seen that it can affect whether
+ // certain whitespace is included in the output, and we want the text
+ // content extracted here to match the text layer as closely as possible.
+ includeMarkedContent: true,
+
normalizeWhitespace: true,
});
let items = textContent.items;
@@ -197,9 +203,18 @@ function getPageTextContent(pageIndex) {
// introduced the `hasEOL` property to text items, so we use the absence
// of this property to determine if we need to filter out whitespace-only strings.
const excludeEmpty = items.length > 0 && !('hasEOL' in items[0]);
- if (excludeEmpty) {
- items = items.filter(it => /\S/.test(it.str));
- }
+
+ items = items.filter(item => {
+ if (item.str === undefined) {
+ // Filter out marked content item.
+ return false;
+ }
+ if (excludeEmpty && !/\S/.test(item.str)) {
+ // Filter out all-whitespace string.
+ return false;
+ }
+ return true;
+ });
return items.map(it => it.str).join('');
};
diff --git a/src/types/pdfjs.js b/src/types/pdfjs.js
index 4d0a9dfdd..4de7cc13f 100644
--- a/src/types/pdfjs.js
+++ b/src/types/pdfjs.js
@@ -63,6 +63,7 @@
/**
* @typedef GetTextContentParameters
+ * @prop {boolean} includeMarkedContent
* @prop {boolean} normalizeWhitespace
*/ I think it would still be worth us looking into making the client robust to differences in spaces between the rendered text layer and what we extract from PDF.js though, as this will make future PDF.js updates easier. |
When anchoring a quote in a PDF, the quote is first searched in text extracted using PDF.js's `PDFPage.getTextContent` API, and the resulting positions are used to create a range within the hidden text layer of a page. An issue we've seen several times when doing PDF.js upgrades is minor changes to which spaces are included in the text layer. In the past we've adapted our text extraction to match the text layer each time. This slows down the process of upgrading PDF.js and makes maintaining compatibility with a range of PDF.js releases more difficult. In the most recent update, an `includeMarkedContent` option was added to the `getTextContent` API, and the presence of that option could affect whether certain whitespaces are included in the output nor not [1]. Try to address this issue generally by mapping offsets from the page text into offsets in the text layer in a way that ignores whitespace differences. - Add `translateOffsets` utility, which translates a (start, end) pair of offsets in one string into offsets in another, ignoring certain characters in the strings. - Use `translateOffsets` utility in PDF anchoring to map quote offsets in the page text returned by `PDFPage.getTextContent` into offsets in the `textContent` of the text layer element. [1] hypothesis/browser-extension#799 (comment)
When anchoring a quote in a PDF, the quote is first searched in text extracted using PDF.js's `PDFPage.getTextContent` API, and the resulting positions are used to create a range within the hidden text layer of a page. An issue we've seen several times when doing PDF.js upgrades is minor changes to which spaces are included in the text layer. In the past we've adapted our text extraction to match the text layer each time. This slows down the process of upgrading PDF.js and makes maintaining compatibility with a range of PDF.js releases more difficult. In the most recent update, an `includeMarkedContent` option was added to the `getTextContent` API, and the presence of that option could affect whether certain whitespaces are included in the output nor not [1]. Try to address this issue generally by mapping offsets from the page text into offsets in the text layer in a way that ignores whitespace differences. - Add `translateOffsets` utility, which translates a (start, end) pair of offsets in one string into offsets in another, ignoring certain characters in the strings. - Use `translateOffsets` utility in PDF anchoring to map quote offsets in the page text returned by `PDFPage.getTextContent` into offsets in the `textContent` of the text layer element. [1] hypothesis/browser-extension#799 (comment)
hypothesis/client#4355 handles changes in spaces in the PDF.js API and text layer in a general way, reducing how often we need to tweak the client's text extraction during PDF.js updates. |
When anchoring a quote in a PDF, the quote is first searched in text extracted using PDF.js's `PDFPage.getTextContent` API, and the resulting positions are used to create a range within the hidden text layer of a page. An issue we've seen several times when doing PDF.js upgrades is minor changes to which spaces are included in the text layer. In the past we've adapted our text extraction to match the text layer each time. This slows down the process of upgrading PDF.js and makes maintaining compatibility with a range of PDF.js releases more difficult. In the most recent update, an `includeMarkedContent` option was added to the `getTextContent` API, and the presence of that option could affect whether certain whitespaces are included in the output nor not [1]. Try to address this issue generally by mapping offsets from the page text into offsets in the text layer in a way that ignores whitespace differences. - Add `translateOffsets` utility, which translates a (start, end) pair of offsets in one string into offsets in another, ignoring certain characters in the strings. - Use `translateOffsets` utility in PDF anchoring to map quote offsets in the page text returned by `PDFPage.getTextContent` into offsets in the `textContent` of the text layer element. [1] hypothesis/browser-extension#799 (comment)
When anchoring a quote in a PDF, the quote is first searched in text extracted using PDF.js's `PDFPage.getTextContent` API, and the resulting positions are used to create a range within the hidden text layer of a page. An issue we've seen several times when doing PDF.js upgrades is minor changes to which spaces are included in the text layer. In the past we've adapted our text extraction to match the text layer each time. This slows down the process of upgrading PDF.js and makes maintaining compatibility with a range of PDF.js releases more difficult. In the most recent update, an `includeMarkedContent` option was added to the `getTextContent` API, and the presence of that option could affect whether certain whitespaces are included in the output nor not [1]. Try to address this issue generally by mapping offsets from the page text into offsets in the text layer in a way that ignores whitespace differences. - Add `translateOffsets` utility, which translates a (start, end) pair of offsets in one string into offsets in another, ignoring certain characters in the strings. - Use `translateOffsets` utility in PDF anchoring to map quote offsets in the page text returned by `PDFPage.getTextContent` into offsets in the `textContent` of the text layer element. [1] hypothesis/browser-extension#799 (comment)
When anchoring a quote in a PDF, the quote is first searched in text extracted using PDF.js's `PDFPage.getTextContent` API, and the resulting positions are used to create a range within the hidden text layer of a page. An issue we've seen several times when doing PDF.js upgrades is minor changes to which spaces are included in the text layer. In the past we've adapted our text extraction to match the text layer each time. This slows down the process of upgrading PDF.js and makes maintaining compatibility with a range of PDF.js releases more difficult. In the most recent update, an `includeMarkedContent` option was added to the `getTextContent` API, and the presence of that option could affect whether certain whitespaces are included in the output nor not [1]. Try to address this issue generally by mapping offsets from the page text into offsets in the text layer in a way that ignores whitespace differences. - Add `translateOffsets` utility, which maps a (start, end) pair of offsets in an input string into corresponding offsets in an output string, where the output is a version of the input that has been "corrupted" by the addition or removal of certain characters (eg. whitespace) - Use `translateOffsets` utility in PDF anchoring to map quote offsets in the page text returned by `PDFPage.getTextContent` into offsets in the `textContent` of the text layer element. [1] hypothesis/browser-extension#799 (comment)
Text with previous PDF.js build:
Text with new build:
Some spaces have gone missing. |
When anchoring a quote in a PDF, the quote is first searched in text extracted using PDF.js's `PDFPage.getTextContent` API, and the resulting positions are used to create a range within the hidden text layer of a page. An issue we've seen several times when doing PDF.js upgrades is minor changes to which spaces are included in the text layer. In the past we've adapted our text extraction to match the text layer each time. This slows down the process of upgrading PDF.js and makes maintaining compatibility with a range of PDF.js releases more difficult. In the most recent update, an `includeMarkedContent` option was added to the `getTextContent` API, and the presence of that option could affect whether certain whitespaces are included in the output nor not [1]. Try to address this issue generally by mapping offsets from the page text into offsets in the text layer in a way that ignores whitespace differences. - Add `translateOffsets` utility, which maps a (start, end) pair of offsets in an input string into corresponding offsets in an output string, where the output is a version of the input that has been "corrupted" by the addition or removal of certain characters (eg. whitespace) - Use `translateOffsets` utility in PDF anchoring to map quote offsets in the page text returned by `PDFPage.getTextContent` into offsets in the `textContent` of the text layer element. [1] hypothesis/browser-extension#799 (comment)
9fb7a65
to
44a28c4
Compare
I have updated again to the latest PDF.js build (2.14.119) and done some testing with recently annotated PDFs in the Public group, using a client build that includes hypothesis/client#4355. |
When anchoring a quote in a PDF, the quote is first searched in text extracted using PDF.js's `PDFPage.getTextContent` API, and the resulting positions are used to create a range within the hidden text layer of a page. An issue we've seen several times when doing PDF.js upgrades is minor changes to which spaces are included in the text layer. In the past we've adapted our text extraction to match the text layer each time. This slows down the process of upgrading PDF.js and makes maintaining compatibility with a range of PDF.js releases more difficult. In the most recent update, an `includeMarkedContent` option was added to the `getTextContent` API, and the presence of that option could affect whether certain whitespaces are included in the output nor not [1]. Try to address this issue generally by mapping offsets from the page text into offsets in the text layer in a way that ignores whitespace differences. - Add `translateOffsets` utility, which maps a (start, end) pair of offsets in an input string into corresponding offsets in an output string, where the output is a version of the input that has been "corrupted" by the addition or removal of certain characters (eg. whitespace) - Use `translateOffsets` utility in PDF anchoring to map quote offsets in the page text returned by `PDFPage.getTextContent` into offsets in the `textContent` of the text layer element. [1] hypothesis/browser-extension#799 (comment)
This serves to standardize how PDF.js updates are committed in the repository.
44a28c4
to
81162e6
Compare
This branch has been updated to include a client build with hypothesis/client#4355, which now means that we can roll out this PDF.js update. |
Update PDF.js using ./tools/update-pdfjs.
81162e6
to
e084d85
Compare
Updated again to v2.14.137 to match the PRs in other repos. |
# Check for uncommitted git changes. See https://stackoverflow.com/a/3879077/434243 | ||
git update-index --refresh | ||
git diff-index --quiet HEAD -- | ||
if [[ $? -ne 0 ]]; then | ||
echo "Cannot update PDF.js when there are uncommitted changes in working tree." | ||
exit 1 | ||
fi | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help me understand why this protection was added to this project's script but not the others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gah. I hadn't actually intended to commit this in this PR, although it would be useful to have. This is an example of why it would be good to avoid duplicating this script across repositories: hypothesis/client#4397.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh. "The safety net is working."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a question about the uncommitted-files check in this project's upgrade-pdfjs script, but otherwise AFAICT this looks good. Was there any testing you wanted me to do? It sounds like this project/branch is where you've done the bulk of the testing already?
This project is where I did most of my manual testing. My plan is to roll out the new PDF.js release in the extension first, let it bake for a few days, and then roll it out to Via. |
The first commit in this PR updates the
tools/update-pdfjs
script to document and standardize how PDF.js updates are committed. The second commit updates PDF.js to the latest release.In testing this build I found that there have been some whitespace changes in the text layer in some cases (see comments below) and also discrepancies between whitespace that the client extracts from a PDF page by calling PDF.js APIs vs what is in the text layer. The client was already robust to whitespace changes between PDF.js releases when anchoring text quotes, and the most recent updates also make it robust to differences in whitespace between the text we extract via PDF.js APIs vs. the whitespace that PDF.js includes in the text layer. See hypothesis/client#4355.
Once the latest version of PDF.js is rolled out to the browser extension, Via and other apps, we should be able to close hypothesis/product-backlog#1283.