Skip to content

Can't find words hyphenated across lines #11752

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
johnmellor opened this issue Mar 26, 2020 · 1 comment · Fixed by #13261
Closed

Can't find words hyphenated across lines #11752

johnmellor opened this issue Mar 26, 2020 · 1 comment · Fixed by #13261
Assignees

Comments

@johnmellor
Copy link

Attach PDF file here: 1503.01817v2.pdf (from https://arxiv.org/pdf/1503.01817v2.pdf)

Configuration:

Steps to reproduce the problem:

  1. Open the attached PDF.
  2. Search for "caption".

What is the expected behavior?

There should be 2 matches, one on the line "captions extracted from the videos", and another straddling the two lines "boxes, segmentations of objects and faces, and image cap-" and "tions are not yet available for the YFCC100M at present,". This is what you get in Chrome's built-in PDF viewer.

What went wrong?

Only the first match is found, as you can see in this screenshot:

screenshot

Analysis

This seems to happen because pdf.js doesn't join lines as well as Chrome's PDF viewer. If you copy the two lines using pdf.js, you get the following string, with an extraneous dash between "cap" and "tions":

boxes, segmentations of objects and faces, and image cap-tions  are  not  yet  available  for  the  YFCC100M  at  present,

whereas if you copy the same text using Chrome's PDF viewer, you get the following string without a hyphen:

boxes, segmentations of objects and faces, and image captions are not yet available for the YFCC100M at present,

(If you search for "cap-tion" with a hyphen in pdf.js, it does match, but that's not particularly useful.)

Possible solutions

The cleanest solution would be to remove line break hyphens in the middle of words when joining lines together, like Chrome's pdfium. This would fix both copy/paste and Find in Document.

A hackier alternative would be to just make Find in Document ignore hyphens (so "caption" would match "cap-tion"), but then there would still be extraneous hyphens when you copy text that was broken across lines midword.

Additional information

Related to #4742 (Can't search across lines in some PDFs) and its PR #5783 (Improve Copy/Paste).

The PDF was generated with standard software for scientific papers: [1.5 pdfTeX-1.40.12 / LaTeX with hyperref package]

@anuragakella
Copy link

Hi, I'd like to work on this.
Can anyone mentor me through this bug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants