You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Anchoring in PDFs currently relies on text extracted from the PDF via PDF.js's PDFPage.getTextContent API to exactly match the textContent property of the hidden text layer element that PDF.js creates. This complicates PDF.js upgrades because we have seen several times minor differences in output (eg. in which whitespace is included). See hypothesis/browser-extension#799 (comment) for example.
I think it will probably make sense to instead make the client's anchoring robust to such differences, even though it will make anchoring a bit more expensive.
To do this we will need to rework the anchorByPosition function in src/annotator/anchoring/pdf.js to remove the assumption that text positions in extracted text exactly match the text layer, or at least the whitespace elements.
The text was updated successfully, but these errors were encountered:
Anchoring in PDFs currently relies on text extracted from the PDF via PDF.js's
PDFPage.getTextContent
API to exactly match thetextContent
property of the hidden text layer element that PDF.js creates. This complicates PDF.js upgrades because we have seen several times minor differences in output (eg. in which whitespace is included). See hypothesis/browser-extension#799 (comment) for example.I think it will probably make sense to instead make the client's anchoring robust to such differences, even though it will make anchoring a bit more expensive.
To do this we will need to rework the
anchorByPosition
function in src/annotator/anchoring/pdf.js to remove the assumption that text positions in extracted text exactly match the text layer, or at least the whitespace elements.The text was updated successfully, but these errors were encountered: