Make PDF anchoring robust to minor differences between extracted text and text layer #4331

robertknight · 2022-03-18T14:11:33Z

Anchoring in PDFs currently relies on text extracted from the PDF via PDF.js's PDFPage.getTextContent API to exactly match the textContent property of the hidden text layer element that PDF.js creates. This complicates PDF.js upgrades because we have seen several times minor differences in output (eg. in which whitespace is included). See hypothesis/browser-extension#799 (comment) for example.

I think it will probably make sense to instead make the client's anchoring robust to such differences, even though it will make anchoring a bit more expensive.

To do this we will need to rework the anchorByPosition function in src/annotator/anchoring/pdf.js to remove the assumption that text positions in extracted text exactly match the text layer, or at least the whitespace elements.

The text was updated successfully, but these errors were encountered:

This was referenced Mar 18, 2022

Update PDF.js to v2.14.137 hypothesis/browser-extension#799

Merged

Last line of PDF text on a page not selectable in the new PDF.js hypothesis/product-backlog#1283

Closed

robertknight mentioned this issue Mar 28, 2022

Allow for whitespace differences between page text and text layer #4355

Merged

robertknight closed this as completed in #4355 Apr 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make PDF anchoring robust to minor differences between extracted text and text layer #4331

Make PDF anchoring robust to minor differences between extracted text and text layer #4331

robertknight commented Mar 18, 2022

Make PDF anchoring robust to minor differences between extracted text and text layer #4331

Make PDF anchoring robust to minor differences between extracted text and text layer #4331

Comments

robertknight commented Mar 18, 2022