You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There should be 2 matches, one on the line "captions extracted from the videos", and another straddling the two lines "boxes, segmentations of objects and faces, and image cap-" and "tions are not yet available for the YFCC100M at present,". This is what you get in Chrome's built-in PDF viewer.
What went wrong?
Only the first match is found, as you can see in this screenshot:
Analysis
This seems to happen because pdf.js doesn't join lines as well as Chrome's PDF viewer. If you copy the two lines using pdf.js, you get the following string, with an extraneous dash between "cap" and "tions":
boxes, segmentations of objects and faces, and image cap-tions are not yet available for the YFCC100M at present,
whereas if you copy the same text using Chrome's PDF viewer, you get the following string without a hyphen:
boxes, segmentations of objects and faces, and image captions are not yet available for the YFCC100M at present,
(If you search for "cap-tion" with a hyphen in pdf.js, it does match, but that's not particularly useful.)
Possible solutions
The cleanest solution would be to remove line break hyphens in the middle of words when joining lines together, like Chrome's pdfium. This would fix both copy/paste and Find in Document.
A hackier alternative would be to just make Find in Document ignore hyphens (so "caption" would match "cap-tion"), but then there would still be extraneous hyphens when you copy text that was broken across lines midword.
Additional information
Related to #4742 (Can't search across lines in some PDFs) and its PR #5783 (Improve Copy/Paste).
The PDF was generated with standard software for scientific papers: [1.5 pdfTeX-1.40.12 / LaTeX with hyperref package]
The text was updated successfully, but these errors were encountered:
Attach PDF file here: 1503.01817v2.pdf (from https://arxiv.org/pdf/1503.01817v2.pdf)
Configuration:
Steps to reproduce the problem:
What is the expected behavior?
There should be 2 matches, one on the line "captions extracted from the videos", and another straddling the two lines "boxes, segmentations of objects and faces, and image cap-" and "tions are not yet available for the YFCC100M at present,". This is what you get in Chrome's built-in PDF viewer.
What went wrong?
Only the first match is found, as you can see in this screenshot:
Analysis
This seems to happen because pdf.js doesn't join lines as well as Chrome's PDF viewer. If you copy the two lines using pdf.js, you get the following string, with an extraneous dash between "cap" and "tions":
whereas if you copy the same text using Chrome's PDF viewer, you get the following string without a hyphen:
(If you search for "cap-tion" with a hyphen in pdf.js, it does match, but that's not particularly useful.)
Possible solutions
The cleanest solution would be to remove line break hyphens in the middle of words when joining lines together, like Chrome's pdfium. This would fix both copy/paste and Find in Document.
A hackier alternative would be to just make Find in Document ignore hyphens (so "caption" would match "cap-tion"), but then there would still be extraneous hyphens when you copy text that was broken across lines midword.
Additional information
Related to #4742 (Can't search across lines in some PDFs) and its PR #5783 (Improve Copy/Paste).
The PDF was generated with standard software for scientific papers: [1.5 pdfTeX-1.40.12 / LaTeX with hyperref package]
The text was updated successfully, but these errors were encountered: