Issue with searchable PDF #1889

FarhadKhalafi · 2018-08-31T19:08:12Z

As of 8/31/2018, if a PDF is generated from the current image without setting the Input Name, the PDF is corrupt. The code sets up all the pointers to the PDF image resource, both in the content stream and the page resources but skips storing the encoded image itself which causes the indirect reference from the image to point to the next object in the file (in my case it was the PDF info dictionary). The problem can be prevented by setting the InputName to the name of the image being processed. I think this requirement can be relaxed as follows.

The current code in pdfrenderer.cpp:

  if (!filename)
    return false;

  L_Compressed_Data *cid = nullptr;
  const int kJpegQuality = 85;

  int format, sad;
  findFileFormat(filename, &format);
  if (pixGetSpp(pix) == 4 && format == IFF_PNG) {
    Pix *p1 = pixAlphaBlendUniform(pix, 0xffffff00);
    sad = pixGenerateCIData(p1, L_FLATE_ENCODE, 0, 0, &cid);
    pixDestroy(&p1);
  } else {
    sad = l_generateCIDataForPdf(filename, pix, kJpegQuality, &cid);
  }

And here is a simple fix (Leptonica needs either the filename or the pix):

  L_Compressed_Data *cid = nullptr;
  const int kJpegQuality = 85;

  int format, sad;
  if (filename) {
    findFileFormat(filename, &format);
    if (pixGetSpp(pix) == 4 && format == IFF_PNG) {
      Pix *p1 = pixAlphaBlendUniform(pix, 0xffffff00);
      sad = pixGenerateCIData(p1, L_FLATE_ENCODE, 0, 0, &cid);
      pixDestroy(&p1);
    }
  }
  if (!cid) {
    sad = l_generateCIDataForPdf(filename, pix, kJpegQuality, &cid);
  }

The text was updated successfully, but these errors were encountered:

jbreiden · 2018-08-31T19:29:29Z

This code change looks fine to me. But I am curious how you ran into this problem. The filename is set from api->GetInputName(). In what scenario is that empty?

FarhadKhalafi · 2018-08-31T20:00:33Z

My understanding from the documentation was that the InputName was optional except for certain special uses, such as in training. I had a Pix in memory which I had preprocessed with Leptonica functions and wanted to convert to a searchable PDF. I didn't know that I had to call api->SetInputName with the name of the image file (what if there is no file name).

We could probably also eliminate the first part of this logic which uses Deflate if the original image was a PNG file and let Leptonica decide on the encoding based on the color depth and the alpha channel.

jbreiden · 2018-09-01T19:40:27Z

The overall philosophy of the PDF rendering module is to be as hands off as possible with the images. The less we do, the less trouble we can get into ("You made my PDF file too big!") and the less pressure for continuous development ("Leptonica needs better transcode heuristics!" "Change the pixel values in the image!"). Forcing with Flate for PNG input is part of the philosophy of minimal image meddling. Why change that?

PS. I'd still like to encourage use of api->SetInputName whenever possible. Do you have any documentation suggestions?

FarhadKhalafi · 2018-09-01T21:04:02Z

Thanks for all your efforts and I totally agree with your philosophy! Tesseract is about OCR and not PDF generation. A lot of libraries can take OCR data and create fancy PDF files. I will be building one since the use of zero-glyph CID font is not going to work for us. We need a switch to toggle between "image with hidden text behind" and "no image with recognized text visible". The issue I reported was not on the approach but the fact that if the input name was not detected, a corrupt PDF was being generated. Calling api->SetInputName corrects that problem. Keep up the good work!

jbreiden · 2018-09-07T01:21:49Z

@FarhadKhalafi Do you want to turn the code change in this bug into a pull request?

zdenop · 2018-09-12T17:02:45Z

@jbreiden : I can commit to source code. Should I?

FarhadKhalafi · 2018-09-12T17:19:36Z

Yes please, if you could. My workload is crazy right now and I am not that familiar with pull requests. Thanks!

…

On Wed, Sep 12, 2018 at 11:04 AM zdenop ***@***.***> wrote: @jbreiden <https://github.com/jbreiden> : I can commit to source code. Should I? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1889 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIGOXUVkw9vYU5hnkrRQERF6rpiZDyOCks5uaT6JgaJpZM4WVuH8> .

jbreiden · 2018-09-12T22:53:38Z

Yes, approved.

Merge branch 'master' into jpg_quality_option * master: (577 commits) fix issue tesseract-ocr#1889 Add badges for download , licence and lgtm Replace macro MINGW by __MINGW32__ EquationDetectBase: Define virtual destructor in .cpp file BlobGrid: Define virtual destructor in .cpp file GridBase: Define virtual destructor in .cpp file AlignedBlob: Define virtual destructor in .cpp file TransposedArray: Define virtual destructor in .cpp file IndexMapBiDi: Define virtual destructor in .cpp file Add missing include file (fixes linker error for Visual Studio) NthItemTest: Add definition for virtual destructor HeapTest: Add definition for virtual destructor IcuErrorCode: Define virtual destructor in .cpp file Validator: Define virtual destructor in .cpp file Dawg: Define virtual destructor in .cpp file CUtil: Define virtual destructor in .cpp file IndexMap: Define virtual destructor in .cpp file CCUtil: Define virtual destructor in .cpp file MATRIX: Define virtual destructor in .cpp file CCStruct: Define virtual destructor in .cpp file ...

FarhadKhalafi · 2018-09-23T17:54:42Z

@dthornley In your commit, I believe you forgot to remove (at the beginning of the module)

if (!filename)
return false;

which must be removed for the fix to work.

zdenop · 2018-09-25T11:16:38Z

@FarhadKhalafi :

IMO your proposed change is not safe because l_generateCIDataForPdf needs filename or pix. So we should change it this way:

  if (!filename && !pix)
    return false;

Can you test it please?

FarhadKhalafi · 2018-09-25T14:22:38Z

@zdenop I think the idea of using a file name provided by the caller independent of the Pix image is a problem. The user can pass any filename (or none) and break this code. Is it poaaible to use Leptonica function pixGetInputFormat and get rid of the filename? Something like:

  if (!pix)
    return false;

  L_Compressed_Data *cid = nullptr;
  const int kJpegQuality = 85;

  if (pixGetSpp(pix) == 4 && pixGetInputFormat(pix) == IFF_PNG) {
      Pix *p1 = pixAlphaBlendUniform(pix, 0xffffff00);
      sad = pixGenerateCIData(p1, L_FLATE_ENCODE, 0, 0, &cid);
      pixDestroy(&p1);
    }
  }
  if (!cid) {
    sad = l_generateCIDataForPdf(NULL, pix, kJpegQuality, &cid);
  }

I believe the reason Leptonica function l_generateCIDataForPdf accepts a filename as well as a pix is for optimization as certain image files can be directly inserted into PDF without decompressing first (e.g. Tiff or Jpeg).

If this doesn't work, then I would agree with your proposed check (or just checking for pix, as filename is checked later in the code). The same check as your suggestion is also done inside l_generateCIDataForPdf, but it is OK to check multiple times.

zdenop · 2018-09-25T18:11:28Z

The code was working fine so I do not see reason why to eliminate usage of filename.
At least not in this situation when we are so close to release of new stable release.

FarhadKhalafi · 2018-09-25T18:40:30Z

I am OK with your comment. Originally I got into this situation when I was not setting the filename and the code was generating a corrupt PDF where the indirect reference to the image stream was pointing to the PDF info stream instead without complaining. Now that I know about the filename, I can set it prior to calling PDF renderer and everything is OK. Sorry if I created a diversion and thank you for all your efforts!

jbreiden added the PDF label Aug 31, 2018

zdenop added a commit that referenced this issue Sep 13, 2018

fix issue #1889

59e42fc

zdenop closed this as completed Sep 13, 2018

FarhadKhalafi mentioned this issue Sep 23, 2018

Issue with searchable PDF #1889 - Reopened #1928

Closed

zdenop reopened this Sep 25, 2018

zdenop closed this as completed in 5dfce74 Sep 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with searchable PDF #1889

Issue with searchable PDF #1889

FarhadKhalafi commented Aug 31, 2018 •

edited

Loading

jbreiden commented Aug 31, 2018

FarhadKhalafi commented Aug 31, 2018

jbreiden commented Sep 1, 2018

FarhadKhalafi commented Sep 1, 2018

jbreiden commented Sep 7, 2018

zdenop commented Sep 12, 2018

FarhadKhalafi commented Sep 12, 2018 via email

jbreiden commented Sep 12, 2018

FarhadKhalafi commented Sep 23, 2018 •

edited

Loading

zdenop commented Sep 25, 2018

FarhadKhalafi commented Sep 25, 2018

zdenop commented Sep 25, 2018

FarhadKhalafi commented Sep 25, 2018

Issue with searchable PDF #1889

Issue with searchable PDF #1889

Comments

FarhadKhalafi commented Aug 31, 2018 • edited Loading

jbreiden commented Aug 31, 2018

FarhadKhalafi commented Aug 31, 2018

jbreiden commented Sep 1, 2018

FarhadKhalafi commented Sep 1, 2018

jbreiden commented Sep 7, 2018

zdenop commented Sep 12, 2018

FarhadKhalafi commented Sep 12, 2018 via email

jbreiden commented Sep 12, 2018

FarhadKhalafi commented Sep 23, 2018 • edited Loading

zdenop commented Sep 25, 2018

FarhadKhalafi commented Sep 25, 2018

zdenop commented Sep 25, 2018

FarhadKhalafi commented Sep 25, 2018

FarhadKhalafi commented Aug 31, 2018 •

edited

Loading

FarhadKhalafi commented Sep 23, 2018 •

edited

Loading