Skip to content

Issue with searchable PDF #1889

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
FarhadKhalafi opened this issue Aug 31, 2018 · 13 comments
Closed

Issue with searchable PDF #1889

FarhadKhalafi opened this issue Aug 31, 2018 · 13 comments
Labels

Comments

@FarhadKhalafi
Copy link

FarhadKhalafi commented Aug 31, 2018

As of 8/31/2018, if a PDF is generated from the current image without setting the Input Name, the PDF is corrupt. The code sets up all the pointers to the PDF image resource, both in the content stream and the page resources but skips storing the encoded image itself which causes the indirect reference from the image to point to the next object in the file (in my case it was the PDF info dictionary). The problem can be prevented by setting the InputName to the name of the image being processed. I think this requirement can be relaxed as follows.

The current code in pdfrenderer.cpp:

  if (!filename)
    return false;

  L_Compressed_Data *cid = nullptr;
  const int kJpegQuality = 85;

  int format, sad;
  findFileFormat(filename, &format);
  if (pixGetSpp(pix) == 4 && format == IFF_PNG) {
    Pix *p1 = pixAlphaBlendUniform(pix, 0xffffff00);
    sad = pixGenerateCIData(p1, L_FLATE_ENCODE, 0, 0, &cid);
    pixDestroy(&p1);
  } else {
    sad = l_generateCIDataForPdf(filename, pix, kJpegQuality, &cid);
  }

And here is a simple fix (Leptonica needs either the filename or the pix):

  L_Compressed_Data *cid = nullptr;
  const int kJpegQuality = 85;

  int format, sad;
  if (filename) {
    findFileFormat(filename, &format);
    if (pixGetSpp(pix) == 4 && format == IFF_PNG) {
      Pix *p1 = pixAlphaBlendUniform(pix, 0xffffff00);
      sad = pixGenerateCIData(p1, L_FLATE_ENCODE, 0, 0, &cid);
      pixDestroy(&p1);
    }
  }
  if (!cid) {
    sad = l_generateCIDataForPdf(filename, pix, kJpegQuality, &cid);
  }

@jbreiden
Copy link
Contributor

This code change looks fine to me. But I am curious how you ran into this problem. The filename is set from api->GetInputName(). In what scenario is that empty?

@jbreiden jbreiden added the PDF label Aug 31, 2018
@FarhadKhalafi
Copy link
Author

My understanding from the documentation was that the InputName was optional except for certain special uses, such as in training. I had a Pix in memory which I had preprocessed with Leptonica functions and wanted to convert to a searchable PDF. I didn't know that I had to call api->SetInputName with the name of the image file (what if there is no file name).

We could probably also eliminate the first part of this logic which uses Deflate if the original image was a PNG file and let Leptonica decide on the encoding based on the color depth and the alpha channel.

@jbreiden
Copy link
Contributor

jbreiden commented Sep 1, 2018

The overall philosophy of the PDF rendering module is to be as hands off as possible with the images. The less we do, the less trouble we can get into ("You made my PDF file too big!") and the less pressure for continuous development ("Leptonica needs better transcode heuristics!" "Change the pixel values in the image!"). Forcing with Flate for PNG input is part of the philosophy of minimal image meddling. Why change that?

PS. I'd still like to encourage use of api->SetInputName whenever possible. Do you have any documentation suggestions?

@FarhadKhalafi
Copy link
Author

Thanks for all your efforts and I totally agree with your philosophy! Tesseract is about OCR and not PDF generation. A lot of libraries can take OCR data and create fancy PDF files. I will be building one since the use of zero-glyph CID font is not going to work for us. We need a switch to toggle between "image with hidden text behind" and "no image with recognized text visible". The issue I reported was not on the approach but the fact that if the input name was not detected, a corrupt PDF was being generated. Calling api->SetInputName corrects that problem. Keep up the good work!

@jbreiden
Copy link
Contributor

jbreiden commented Sep 7, 2018

@FarhadKhalafi Do you want to turn the code change in this bug into a pull request?

@zdenop
Copy link
Contributor

zdenop commented Sep 12, 2018

@jbreiden : I can commit to source code. Should I?

@FarhadKhalafi
Copy link
Author

FarhadKhalafi commented Sep 12, 2018 via email

@jbreiden
Copy link
Contributor

Yes, approved.

zdenop added a commit that referenced this issue Sep 13, 2018
@zdenop zdenop closed this as completed Sep 13, 2018
dthornley pushed a commit to picturae/tesseract that referenced this issue Sep 14, 2018
Merge branch 'master' into jpg_quality_option

* master: (577 commits)
  fix issue tesseract-ocr#1889
  Add badges for download , licence and lgtm
  Replace macro MINGW by __MINGW32__
  EquationDetectBase: Define virtual destructor in .cpp file
  BlobGrid: Define virtual destructor in .cpp file
  GridBase: Define virtual destructor in .cpp file
  AlignedBlob: Define virtual destructor in .cpp file
  TransposedArray: Define virtual destructor in .cpp file
  IndexMapBiDi: Define virtual destructor in .cpp file
  Add missing include file (fixes linker error for Visual Studio)
  NthItemTest: Add definition for virtual destructor
  HeapTest: Add definition for virtual destructor
  IcuErrorCode: Define virtual destructor in .cpp file
  Validator: Define virtual destructor in .cpp file
  Dawg: Define virtual destructor in .cpp file
  CUtil: Define virtual destructor in .cpp file
  IndexMap: Define virtual destructor in .cpp file
  CCUtil: Define virtual destructor in .cpp file
  MATRIX: Define virtual destructor in .cpp file
  CCStruct: Define virtual destructor in .cpp file
  ...
@FarhadKhalafi
Copy link
Author

FarhadKhalafi commented Sep 23, 2018

@dthornley In your commit, I believe you forgot to remove (at the beginning of the module)

if (!filename)
return false;

which must be removed for the fix to work.

@zdenop
Copy link
Contributor

zdenop commented Sep 25, 2018

@FarhadKhalafi :

IMO your proposed change is not safe because l_generateCIDataForPdf needs filename or pix. So we should change it this way:

  if (!filename && !pix)
    return false;

Can you test it please?

@zdenop zdenop reopened this Sep 25, 2018
@FarhadKhalafi
Copy link
Author

@zdenop I think the idea of using a file name provided by the caller independent of the Pix image is a problem. The user can pass any filename (or none) and break this code. Is it poaaible to use Leptonica function pixGetInputFormat and get rid of the filename? Something like:

  if (!pix)
    return false;

  L_Compressed_Data *cid = nullptr;
  const int kJpegQuality = 85;

  if (pixGetSpp(pix) == 4 && pixGetInputFormat(pix) == IFF_PNG) {
      Pix *p1 = pixAlphaBlendUniform(pix, 0xffffff00);
      sad = pixGenerateCIData(p1, L_FLATE_ENCODE, 0, 0, &cid);
      pixDestroy(&p1);
    }
  }
  if (!cid) {
    sad = l_generateCIDataForPdf(NULL, pix, kJpegQuality, &cid);
  }

I believe the reason Leptonica function l_generateCIDataForPdf accepts a filename as well as a pix is for optimization as certain image files can be directly inserted into PDF without decompressing first (e.g. Tiff or Jpeg).

If this doesn't work, then I would agree with your proposed check (or just checking for pix, as filename is checked later in the code). The same check as your suggestion is also done inside l_generateCIDataForPdf, but it is OK to check multiple times.

@zdenop
Copy link
Contributor

zdenop commented Sep 25, 2018

The code was working fine so I do not see reason why to eliminate usage of filename.
At least not in this situation when we are so close to release of new stable release.

@FarhadKhalafi
Copy link
Author

I am OK with your comment. Originally I got into this situation when I was not setting the filename and the code was generating a corrupt PDF where the indirect reference to the image stream was pointing to the PDF info stream instead without complaining. Now that I know about the filename, I can set it prior to calling PDF renderer and everything is OK. Sorry if I created a diversion and thank you for all your efforts!

@zdenop zdenop closed this as completed in 5dfce74 Sep 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants